Utils
The data-utils in this module are useful for converting time-series data from a Pandas DataFrame into a PyTorch
torch.utils.data.Dataset and/or torch.utils.data.DataLoader. The most common pattern is using the
from_dataframe() classmethod.
Additionally, utility functions are provided for handling missing data and adding calendar-features (i.e. weekly/daily/yearly season dummy-features that can be passed to any neural-network).
- class torchcast.utils.TimeSeriesDataLoader(dataset: torch.utils.data.dataset.Dataset, batch_size: Optional[int], pad_X: Optional[float] = 0.0, **kwargs)
This is a convenience wrapper around
DataLoader(collate_fn=TimeSeriesDataset.make_collate_fn()). Additionally, it provides afrom_dataframe()classmethod so that the data-loader can be created directly from a pandas dataframe. This can be more memory-efficient than the alternative route of first creating aTimeSeriesDatasetfrom a dataframe, and then passing that object to a data-loader.- classmethod from_dataframe(dataframe: DataFrame, group_colname: str, time_colname: str, dt_unit: Optional[str], measure_colnames: Optional[Sequence[str]] = None, X_colnames: Optional[Sequence[str]] = None, y_colnames: Optional[Sequence[str]] = None, pad_X: Optional[float] = 0.0, **kwargs) TimeSeriesDataLoader
- Parameters
dataframe – A pandas
DataFramegroup_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings. Setting
pad_Xallows you to select the padding value for these. Default 0-padding.kwargs – Other arguments to pass to
TimeSeriesDataset.from_dataframe().
- Returns
An iterable that yields
TimeSeriesDataset.
- class torchcast.utils.TimeSeriesDataset(*tensors: torch.Tensor, group_names: Sequence[Any], start_times: Union[numpy.ndarray, Sequence], measures: Sequence[Sequence[str]], dt_unit: Optional[str])
TimeSeriesDatasetincludes additional information about each of the Tensors’ dimensions: the name for each group in the first dimension, the start (date)time (and optionally datetime-unit) for the second dimension, and the name of the measures for the third dimension.Note that unlike
torch.utils.data.TensorDataset, indexing aTimeSeriesDatasetreturns anotherTimeSeriesDataset, not a tuple of tensors. So when usingTimeSeriesDataset, useTimeSeriesDataLoader(equivalent toDataLoader(collate_fn=TimeSeriesDataset.collate)).- classmethod from_dataframe(dataframe: DataFrame, group_colname: Optional[str], time_colname: str, dt_unit: Optional[str], measure_colnames: Optional[Sequence[str]] = None, X_colnames: Optional[Sequence[str]] = None, y_colnames: Optional[Sequence[str]] = None, pad_X: Optional[float] = 0.0, **kwargs) TimeSeriesDataset
- Parameters
dataframe – A pandas
DataFramegroup_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings. Setting
pad_Xallows you to select the padding value for these. Default 0-padding.kwargs – The dtype and/or the device.
- Returns
- get_groups(groups: Sequence[Any]) torchcast.utils.data.TimeSeriesDataset
Get the subset of the batch corresponding to groups. Note that the ordering in the output will match the original ordering (not that of group), and that duplicates will be dropped.
- split_measures(*measure_groups, which: Optional[int] = None) torchcast.utils.data.TimeSeriesDataset
Take a dataset with one tensor, split it into a dataset with multiple tensors.
- Parameters
measure_groups – Each argument should be be a list of measure-names, or an indexer (i.e. list of ints or a slice).
which – If there are already multiple measure groups, the split will occur within one of them; must specify which.
- Returns
A
TimeSeriesDataset, now with multiple tensors for the measure-groups.
- times(which: Optional[int] = None) numpy.ndarray
A 2D array of datetimes (or integers if dt_unit is None) for this dataset.
- Parameters
which – If this dataset has multiple tensors of different number of timesteps, which should be used for constructing the times array? Defaults to the one with the most timesteps.
- Returns
A 2D numpy array of datetimes (or integers if dt_unit is None).
- train_val_split(train_frac: float = None, dt: Union[numpy.datetime64, dict] = None, quiet: bool = False) Tuple[torchcast.utils.data.TimeSeriesDataset, torchcast.utils.data.TimeSeriesDataset]
- Parameters
train_frac – The proportion of the data to keep for training. This is calculated on a per-group basis, by taking the last observation for each group (i.e., the last observation that a non-nan value on any measure). If neither train_frac nor dt are passed,
train_frac=.75is used.dt – A datetime to use in dividing train/validation (first datetime for validation), or a dictionary of group-names : date-times.
- Returns
Two
TimeSeriesDatasets, one with data before the split, the other with >= the split.
- with_new_start_times(start_times: Union[numpy.ndarray, Sequence], quiet: bool = False) torchcast.utils.data.TimeSeriesDataset
Subset a
TimeSeriesDatasetso that some/all of the groups have later start times.- Parameters
start_times – An array/list of new datetimes.
quiet – If True, will not emit a warning for groups having only nan after the start-time.
- Returns
A new
TimeSeriesDataset.
- with_new_tensors(*tensors: torch.Tensor) torchcast.utils.data.TimeSeriesDataset
Create a new Batch with a different Tensor, but all other attributes the same.
- torchcast.utils.add_season_features(data: DataFrame, K: int, period: Union[numpy.timedelta64, str], time_colname: Optional[str] = None) DataFrame
Add season features to data by taking a date[time]-column and passing it through a fourier-transform.
- Parameters
data – A dataframe with a date[time] column.
K – The degrees of freedom for the fourier transform. Higher K means more flexible seasons can be captured.
period – Either a np.timedelta64, or one of {‘weekly’,’yearly’,’daily’}
time_colname – The name of the date[time] column. Default is to try and guess with the following (in order): ‘datetime’, ‘date’, ‘timestamp’, ‘time’.
- Returns
A copy of the original dataframe, now with K*2 additional columns capturing the seasonal pattern.
- torchcast.utils.complete_times(data: DataFrame, group_colname: str, time_colname: Optional[str] = None, dt_unit: Optional[str] = None)
Given a dataframe time-serieses, convert implicit missings within each time-series to explicit missings.
- Parameters
data – A pandas dataframe.
group_colname – The column name for the groups.
time_colname – The column name for the times. Will attempt to guess based on common labels.
dt_unit – A
numpy.datetime64or string representing the datetime increments. If not supplied will try to guess based on the smallest difference in the data.
- Returns
A dataframe where implicit missings are converted to explicit missings, but the min/max time for each group is preserved.