Utils

The data-utils in this module are useful for converting time-series data from a Pandas DataFrame into a PyTorch torch.utils.data.Dataset and/or torch.utils.data.DataLoader. The most common pattern is using the from_dataframe() classmethod.

Additionally, utility functions are provided for handling missing data and adding calendar-features (i.e. weekly/daily/yearly season dummy-features that can be passed to any neural-network).

class torchcast.utils.TimeSeriesDataLoader(dataset: torch.utils.data.dataset.Dataset, batch_size: Optional[int], pad_X: Optional[float] = 0.0, **kwargs)

This is a convenience wrapper around DataLoader(collate_fn=TimeSeriesDataset.make_collate_fn()). Additionally, it provides a from_dataframe() classmethod so that the data-loader can be created directly from a pandas dataframe. This can be more memory-efficient than the alternative route of first creating a TimeSeriesDataset from a dataframe, and then passing that object to a data-loader.

classmethod from_dataframe(dataframe: DataFrame, group_colname: str, time_colname: str, dt_unit: Optional[str], measure_colnames: Optional[Sequence[str]] = None, X_colnames: Optional[Sequence[str]] = None, y_colnames: Optional[Sequence[str]] = None, pad_X: Optional[float] = 0.0, **kwargs) → TimeSeriesDataLoader

Parameters

dataframe – A pandas DataFrame
group_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings. Setting pad_X allows you to select the padding value for these. Default 0-padding.
kwargs – Other arguments to pass to TimeSeriesDataset.from_dataframe().

Returns

An iterable that yields TimeSeriesDataset.

class torchcast.utils.TimeSeriesDataset(*tensors: torch.Tensor, group_names: Sequence[Any], start_times: Union[numpy.ndarray, Sequence], measures: Sequence[Sequence[str]], dt_unit: Optional[str])

TimeSeriesDataset includes additional information about each of the Tensors’ dimensions: the name for each group in the first dimension, the start (date)time (and optionally datetime-unit) for the second dimension, and the name of the measures for the third dimension.

Note that unlike torch.utils.data.TensorDataset, indexing a TimeSeriesDataset returns another TimeSeriesDataset, not a tuple of tensors. So when using TimeSeriesDataset, use TimeSeriesDataLoader (equivalent to DataLoader(collate_fn=TimeSeriesDataset.collate)).

classmethod from_dataframe(dataframe: DataFrame, group_colname: Optional[str], time_colname: str, dt_unit: Optional[str], measure_colnames: Optional[Sequence[str]] = None, X_colnames: Optional[Sequence[str]] = None, y_colnames: Optional[Sequence[str]] = None, pad_X: Optional[float] = 0.0, **kwargs) → TimeSeriesDataset

Parameters

dataframe – A pandas DataFrame
group_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings. Setting pad_X allows you to select the padding value for these. Default 0-padding.
kwargs – The dtype and/or the device.

Returns

A TimeSeriesDataset.

get_groups(groups: Sequence[Any]) → torchcast.utils.data.TimeSeriesDataset: Get the subset of the batch corresponding to groups. Note that the ordering in the output will match the original ordering (not that of group), and that duplicates will be dropped.

split_measures(*measure_groups, which: Optional[int] = None) → torchcast.utils.data.TimeSeriesDataset

Take a dataset with one tensor, split it into a dataset with multiple tensors.

Parameters

measure_groups – Each argument should be be a list of measure-names, or an indexer (i.e. list of ints or a slice).
which – If there are already multiple measure groups, the split will occur within one of them; must specify which.

Returns

A TimeSeriesDataset, now with multiple tensors for the measure-groups.

times(which: Optional[int] = None) → numpy.ndarray

A 2D array of datetimes (or integers if dt_unit is None) for this dataset.

Parameters: which – If this dataset has multiple tensors of different number of timesteps, which should be used for constructing the times array? Defaults to the one with the most timesteps.
Returns: A 2D numpy array of datetimes (or integers if dt_unit is None).

train_val_split(train_frac: float = None, dt: Union[numpy.datetime64, dict] = None, quiet: bool = False) → Tuple[torchcast.utils.data.TimeSeriesDataset, torchcast.utils.data.TimeSeriesDataset]

Parameters

train_frac – The proportion of the data to keep for training. This is calculated on a per-group basis, by taking the last observation for each group (i.e., the last observation that a non-nan value on any measure). If neither train_frac nor dt are passed, train_frac=.75 is used.
dt – A datetime to use in dividing train/validation (first datetime for validation), or a dictionary of group-names : date-times.

Returns

Two TimeSeriesDatasets, one with data before the split, the other with >= the split.

with_new_start_times(start_times: Union[numpy.ndarray, Sequence], quiet: bool = False) → torchcast.utils.data.TimeSeriesDataset

Subset a TimeSeriesDataset so that some/all of the groups have later start times.

Parameters

start_times – An array/list of new datetimes.
quiet – If True, will not emit a warning for groups having only nan after the start-time.

Returns

A new TimeSeriesDataset.

with_new_tensors(*tensors: torch.Tensor) → torchcast.utils.data.TimeSeriesDataset: Create a new Batch with a different Tensor, but all other attributes the same.

torchcast.utils.add_season_features(data: DataFrame, K: int, period: Union[numpy.timedelta64, str], time_colname: Optional[str] = None) → DataFrame

Add season features to data by taking a date[time]-column and passing it through a fourier-transform.

Parameters

data – A dataframe with a date[time] column.
K – The degrees of freedom for the fourier transform. Higher K means more flexible seasons can be captured.
period – Either a np.timedelta64, or one of {‘weekly’,’yearly’,’daily’}
time_colname – The name of the date[time] column. Default is to try and guess with the following (in order): ‘datetime’, ‘date’, ‘timestamp’, ‘time’.

Returns

A copy of the original dataframe, now with K*2 additional columns capturing the seasonal pattern.

torchcast.utils.complete_times(data: DataFrame, group_colname: str, time_colname: Optional[str] = None, dt_unit: Optional[str] = None)

Given a dataframe time-serieses, convert implicit missings within each time-series to explicit missings.

Parameters

data – A pandas dataframe.
group_colname – The column name for the groups.
time_colname – The column name for the times. Will attempt to guess based on common labels.
dt_unit – A numpy.datetime64 or string representing the datetime increments. If not supplied will try to guess based on the smallest difference in the data.

Returns

A dataframe where implicit missings are converted to explicit missings, but the min/max time for each group is preserved.