Multivariate Forecasts: Beijing Multi-Site Air-Quality Data
We’ll demonstrate several features of torchcast using a dataset from the UCI Machine Learning Data Repository. It includes data on air pollutants and weather from 12 sites.
[2]:
df_aq = load_air_quality_data('weekly')
SPLIT_DT = np.datetime64('2016-02-22')
df_aq
loading from gh...
[2]:
| week | station | PM2p5 | PM10 | SO2 | NO2 | CO | O3 | TEMP | PRES | DEWP | RAIN | WSPM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-02-25 | Aotizhongxin | 38.263889 | 57.791667 | 36.541667 | 56.750000 | 958.236111 | 37.583333 | 2.525000 | 1022.777778 | -15.666667 | 0.0 | 2.130556 |
| 1 | 2013-02-25 | Changping | 31.986111 | 47.152778 | 31.870773 | 44.969203 | 870.355072 | 45.925725 | 1.915278 | 1019.633333 | -16.231944 | 0.0 | 1.475000 |
| 2 | 2013-02-25 | Dingling | 28.083333 | 37.816919 | 15.955314 | 34.916667 | 627.838164 | 49.222222 | 1.915278 | 1019.633333 | -16.231944 | 0.0 | 1.475000 |
| 3 | 2013-02-25 | Dongsi | 45.083333 | 60.680556 | 31.687198 | 60.160024 | 1165.036836 | 43.635870 | 2.268056 | 1024.697222 | -16.913889 | 0.0 | 1.775000 |
| 4 | 2013-02-25 | Guanyuan | 38.472222 | 55.208333 | 35.467995 | 62.855676 | 1075.746981 | 38.277778 | 2.525000 | 1022.777778 | -15.666667 | 0.0 | 2.130556 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2515 | 2017-02-27 | Nongzhanguan | 46.807359 | 76.195652 | 13.304348 | 56.108696 | 1032.608696 | 41.152174 | 9.647917 | 1016.014583 | -9.964583 | 0.0 | 1.787500 |
| 2516 | 2017-02-27 | Shunyi | 47.020833 | 58.625000 | 13.750000 | 61.791667 | 1033.333333 | 40.750000 | 8.289583 | 1016.695833 | -8.525000 | 0.0 | 1.827083 |
| 2517 | 2017-02-27 | Tiantan | 43.662500 | 70.429167 | 10.577273 | 64.052381 | 1012.954545 | 37.800000 | 9.647917 | 1016.014583 | -9.964583 | 0.0 | 1.787500 |
| 2518 | 2017-02-27 | Wanliu | 36.979167 | 60.750000 | 12.663949 | 65.371377 | 1061.322464 | 35.964015 | 8.868750 | 1014.450000 | -9.300000 | 0.0 | 1.558333 |
| 2519 | 2017-02-27 | Wanshouxigong | 39.550595 | 55.199405 | 9.835404 | 57.981366 | 993.788820 | 37.146998 | 9.647917 | 1016.014583 | -9.964583 | 0.0 | 1.787500 |
2520 rows × 13 columns
Univariate Forecasts
Let’s try to build a model to predict total particulate-matter (PM2.5 and PM10).
First, we’ll make our target the sum of these two types. We’ll log-transform since this is strictly positive.
[3]:
from torchcast.process import LocalTrend, Season
# create a dataset:
df_aq['PM'] = df_aq['PM10'] + df_aq['PM2p5']
df_aq['PM_log10'] = np.log10(df_aq['PM'])
dataset_pm_univariate = TimeSeriesDataset.from_dataframe(
dataframe=df_aq,
dt_unit='W',
measure_colnames=['PM_log10'],
group_colname='station',
time_colname='week'
)
dataset_pm_univariate_train, _ = dataset_pm_univariate.train_val_split(dt=SPLIT_DT)
# create a model:
kf_pm_univariate = KalmanFilter(
measures=['PM_log10'],
processes=[
LocalTrend(id='trend'),
Season(id='day_in_year', period=365.25 / 7, dt_unit='W', K=5, fixed=True)
]
)
# fit:
kf_pm_univariate.fit(
dataset_pm_univariate_train.tensors[0],
start_offsets=dataset_pm_univariate_train.start_datetimes
)
[3]:
KalmanFilter(processes=[LocalTrend(id='trend'), Season(id='day_in_year')], measures=['PM_log10'])
Let’s see how our forecasts look:
[4]:
# helper for transforming log back to original:
def inverse_transform(df):
df = df.copy()
# bias-correction for log-transform (see https://otexts.com/fpp2/transformations.html#bias-adjustments)
df['mean'] = df['mean'] + .5 * (df['upper'] - df['lower']) / 1.96 ** 2
# inverse the log10:
df[['actual', 'mean', 'upper', 'lower']] = 10 ** df[['actual', 'mean', 'upper', 'lower']]
df['measure'] = df['measure'].str.replace('_log10', '')
return df
# generate forecasts:
forecast = kf_pm_univariate(
dataset_pm_univariate_train.tensors[0],
start_offsets=dataset_pm_univariate_train.start_datetimes,
out_timesteps=dataset_pm_univariate.tensors[0].shape[1]
)
df_forecast = inverse_transform(forecast.to_dataframe(dataset_pm_univariate))
print(forecast.plot(df_forecast, max_num_groups=3, split_dt=SPLIT_DT))
Subsetting to groups: ['Gucheng', 'Tiantan', 'Guanyuan']
Evaluating Performance: Expanding Window
To evaluate our forecasts, we will not use the long-range forecasts above. Instead, we will use an expanding window approach to evaluate a shorter forecast horizon. In this approach, we generate N-step-ahead forecasts at every timepoint:

This approach is straightforward in torchcast, using the n_step argument. Here we’ll generate 4-week-ahead predictions. Note that we’re still separating the validation time-period.
[5]:
with torch.no_grad():
pred_4step = kf_pm_univariate(
dataset_pm_univariate.tensors[0],
start_offsets=dataset_pm_univariate.start_datetimes,
n_step=4
)
df_univariate_error = pred_4step.\
to_dataframe(dataset_pm_univariate, group_colname='station', time_colname='week').\
pipe(inverse_transform).\
merge(df_aq.loc[:,['station', 'week', 'PM']]).\
assign(
error = lambda df: np.abs(df['mean'] - df['actual']),
validation = lambda df: df['week'] > SPLIT_DT
).\
groupby(['station','validation'])\
['error'].mean().\
reset_index()
df_univariate_error.groupby('validation')['error'].agg(['mean','std'])
[5]:
| mean | std | |
|---|---|---|
| validation | ||
| False | 70.568771 | 2.194658 |
| True | 61.550529 | 4.278333 |
Multivariate Forecasts
Can we improve our model by splitting the pollutant we are forecasting into its two types (2.5 and 10), and modeling them in a multivariate manner?
[6]:
# create a dataset:
df_aq['PM10_log10'] = np.log10(df_aq['PM10'])
df_aq['PM2p5_log10'] = np.log10(df_aq['PM2p5'])
dataset_pm_multivariate = TimeSeriesDataset.from_dataframe(
dataframe=df_aq,
dt_unit='W',
measure_colnames=['PM10_log10','PM2p5_log10'],
group_colname='station',
time_colname='week'
)
dataset_pm_multivariate_train, _ = dataset_pm_multivariate.train_val_split(dt=SPLIT_DT)
# create a model:
_processes = []
for m in dataset_pm_multivariate.measures[0]:
_processes.extend([
LocalTrend(id=f'{m}_trend', measure=m),
Season(id=f'{m}_day_in_year', period=365.25 / 7, dt_unit='W', K=5, measure=m, fixed=True)
])
kf_pm_multivariate = KalmanFilter(measures=dataset_pm_multivariate.measures[0], processes=_processes)
# fit:
kf_pm_multivariate.fit(
dataset_pm_multivariate_train.tensors[0],
start_offsets=dataset_pm_multivariate_train.start_datetimes
)
[6]:
KalmanFilter(processes=[LocalTrend(id='PM10_log10_trend'), Season(id='PM10_log10_day_in_year'), LocalTrend(id='PM2p5_log10_trend'), Season(id='PM2p5_log10_day_in_year')], measures=['PM10_log10', 'PM2p5_log10'])
We can generate our 4-step-ahead predictions for validation as we did before:
[7]:
with torch.no_grad():
pred_4step = kf_pm_multivariate(
dataset_pm_multivariate.tensors[0],
start_offsets=dataset_pm_multivariate.start_datetimes,
n_step=4
)
pred_4step.means.shape
[7]:
torch.Size([12, 210, 2])
At this point, though, we run into a problem: we we have forecasts for both PM2.5 and PM10, but we ultimately want a forecast for their sum. With untransformed data, we could take advantage of the fact that sum of correlated normals is still normal:
[8]:
torch.sum(pred_4step.means, 2)
[8]:
tensor([[3.9723, 4.0158, 4.0289, ..., 3.8664, 3.9616, 4.0321],
[3.9723, 4.0158, 4.0289, ..., 3.7751, 3.8638, 3.9208],
[3.9723, 4.0158, 4.0289, ..., 3.7689, 3.8624, 3.9209],
...,
[3.9723, 4.0158, 4.0289, ..., 3.8679, 3.9543, 4.0209],
[3.9723, 4.0158, 4.0289, ..., 3.8622, 3.9517, 4.0192],
[3.9723, 4.0158, 4.0289, ..., 3.9409, 4.0311, 4.1021]])
In our case this unfortunately won’t work: we have log-transformed our measures. This seems like it was the right choice (i.e. our residuals look reasonably normal and i.i.d):
[9]:
pred_4step.plot(pred_4step.to_dataframe(dataset_pm_multivariate, type='components').query("process=='residuals'"))
Subsetting to groups: ['Shunyi']
/home/docs/checkouts/readthedocs.org/user_builds/torchcast/envs/latest/lib/python3.7/site-packages/plotnine/facets/facet.py:393: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.
[9]:
<ggplot: (8767450191265)>
In this case, we can’t take the sum of our forecasts to get the forecast of the sum, and there’s no simple closed-form expression for the sum of lognormals.
One option that is fairly easy in torchcast is to use a Monte-Carlo approach: we’ll just generate random-samples based on the means and covariances underlying our forecast. In that case, the sum of the PM2.5 + PM10 forecasted-samples is the forecasted PM sum we are looking for:
[10]:
# generate draws from the forecast distribution:
mc_draws = 10 ** torch.distributions.MultivariateNormal(*pred_4step).rsample((500,))
# sum across 2.5 and 10, then mean across draws:
mc_predictions = mc_draws.sum(-1, keepdim=True).mean(0)
# convert to a dataframe and summarize error:
_df_pred = TimeSeriesDataset.tensor_to_dataframe(
mc_predictions,
times=dataset_pm_multivariate.times(),
group_names=dataset_pm_multivariate.group_names,
group_colname='station',
time_colname='week',
measures=['predicted']
)
df_multivariate_error = _df_pred.\
merge(df_aq.loc[:,['station', 'week', 'PM']]).\
assign(
error = lambda df: np.abs(df['predicted'] - df['PM']),
validation = lambda df: df['week'] > SPLIT_DT
).\
groupby(['station','validation'])\
['error'].mean().\
reset_index()
df_multivariate_error.groupby('validation')['error'].agg(['mean','std'])
[10]:
| mean | std | |
|---|---|---|
| validation | ||
| False | 63.06858 | 2.392846 |
| True | 60.35954 | 5.037340 |
We see that this approach has reduced our error: substantially in the training period, and moderately in the validation period. We can look at the per-site differences to reduce common sources of noise and see that the reduction is consistent (it holds for all but one site):
[11]:
df_multivariate_error.\
merge(df_univariate_error, on=['station', 'validation']).\
assign(error_diff = lambda df: df['error_x'] - df['error_y']).\
boxplot('error_diff', by='validation')
[11]:
<AxesSubplot:title={'center':'error_diff'}, xlabel='validation'>
[ ]: