Multivariate Forecasts: Beijing Multi-Site Air-Quality Data

We’ll demonstrate several features of torchcast using a dataset from the UCI Machine Learning Data Repository. It includes data on air pollutants and weather from 12 sites.

[2]:
df_aq = load_air_quality_data('weekly')

SPLIT_DT = np.datetime64('2016-02-22')

df_aq
loading from gh...
[2]:
week station PM2p5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN WSPM
0 2013-02-25 Aotizhongxin 38.263889 57.791667 36.541667 56.750000 958.236111 37.583333 2.525000 1022.777778 -15.666667 0.0 2.130556
1 2013-02-25 Changping 31.986111 47.152778 31.870773 44.969203 870.355072 45.925725 1.915278 1019.633333 -16.231944 0.0 1.475000
2 2013-02-25 Dingling 28.083333 37.816919 15.955314 34.916667 627.838164 49.222222 1.915278 1019.633333 -16.231944 0.0 1.475000
3 2013-02-25 Dongsi 45.083333 60.680556 31.687198 60.160024 1165.036836 43.635870 2.268056 1024.697222 -16.913889 0.0 1.775000
4 2013-02-25 Guanyuan 38.472222 55.208333 35.467995 62.855676 1075.746981 38.277778 2.525000 1022.777778 -15.666667 0.0 2.130556
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2515 2017-02-27 Nongzhanguan 46.807359 76.195652 13.304348 56.108696 1032.608696 41.152174 9.647917 1016.014583 -9.964583 0.0 1.787500
2516 2017-02-27 Shunyi 47.020833 58.625000 13.750000 61.791667 1033.333333 40.750000 8.289583 1016.695833 -8.525000 0.0 1.827083
2517 2017-02-27 Tiantan 43.662500 70.429167 10.577273 64.052381 1012.954545 37.800000 9.647917 1016.014583 -9.964583 0.0 1.787500
2518 2017-02-27 Wanliu 36.979167 60.750000 12.663949 65.371377 1061.322464 35.964015 8.868750 1014.450000 -9.300000 0.0 1.558333
2519 2017-02-27 Wanshouxigong 39.550595 55.199405 9.835404 57.981366 993.788820 37.146998 9.647917 1016.014583 -9.964583 0.0 1.787500

2520 rows × 13 columns

Univariate Forecasts

Let’s try to build a model to predict total particulate-matter (PM2.5 and PM10).

First, we’ll make our target the sum of these two types. We’ll log-transform since this is strictly positive.

[3]:
from torchcast.process import LocalTrend, Season

# create a dataset:
df_aq['PM'] = df_aq['PM10'] + df_aq['PM2p5']
df_aq['PM_log10'] = np.log10(df_aq['PM'])
dataset_pm_univariate = TimeSeriesDataset.from_dataframe(
    dataframe=df_aq,
    dt_unit='W',
    measure_colnames=['PM_log10'],
    group_colname='station',
    time_colname='week'
)
dataset_pm_univariate_train, _ = dataset_pm_univariate.train_val_split(dt=SPLIT_DT)

# create a model:
kf_pm_univariate = KalmanFilter(
    measures=['PM_log10'],
    processes=[
        LocalTrend(id='trend'),
        Season(id='day_in_year', period=365.25 / 7, dt_unit='W', K=5, fixed=True)
    ]
)

# fit:
kf_pm_univariate.fit(
    dataset_pm_univariate_train.tensors[0],
    start_offsets=dataset_pm_univariate_train.start_datetimes
)
[3]:
KalmanFilter(processes=[LocalTrend(id='trend'), Season(id='day_in_year')], measures=['PM_log10'])

Let’s see how our forecasts look:

[4]:
# helper for transforming log back to original:
def inverse_transform(df):
    df = df.copy()
    # bias-correction for log-transform (see https://otexts.com/fpp2/transformations.html#bias-adjustments)
    df['mean'] = df['mean'] + .5 * (df['upper'] - df['lower']) / 1.96 ** 2
    # inverse the log10:
    df[['actual', 'mean', 'upper', 'lower']] = 10 ** df[['actual', 'mean', 'upper', 'lower']]
    df['measure'] = df['measure'].str.replace('_log10', '')
    return df

# generate forecasts:
forecast = kf_pm_univariate(
        dataset_pm_univariate_train.tensors[0],
        start_offsets=dataset_pm_univariate_train.start_datetimes,
        out_timesteps=dataset_pm_univariate.tensors[0].shape[1]
)

df_forecast = inverse_transform(forecast.to_dataframe(dataset_pm_univariate))
print(forecast.plot(df_forecast, max_num_groups=3, split_dt=SPLIT_DT))
Subsetting to groups: ['Gucheng', 'Tiantan', 'Guanyuan']
../_images/examples_air_quality_6_1.png

Evaluating Performance: Expanding Window

To evaluate our forecasts, we will not use the long-range forecasts above. Instead, we will use an expanding window approach to evaluate a shorter forecast horizon. In this approach, we generate N-step-ahead forecasts at every timepoint:

title

This approach is straightforward in torchcast, using the n_step argument. Here we’ll generate 4-week-ahead predictions. Note that we’re still separating the validation time-period.

[5]:
with torch.no_grad():
    pred_4step = kf_pm_univariate(
        dataset_pm_univariate.tensors[0],
        start_offsets=dataset_pm_univariate.start_datetimes,
        n_step=4
    )

df_univariate_error = pred_4step.\
    to_dataframe(dataset_pm_univariate, group_colname='station', time_colname='week').\
    pipe(inverse_transform).\
    merge(df_aq.loc[:,['station', 'week', 'PM']]).\
    assign(
        error = lambda df: np.abs(df['mean'] - df['actual']),
        validation = lambda df: df['week'] > SPLIT_DT
    ).\
    groupby(['station','validation'])\
    ['error'].mean().\
    reset_index()
df_univariate_error.groupby('validation')['error'].agg(['mean','std'])
[5]:
mean std
validation
False 70.568771 2.194658
True 61.550529 4.278333

Multivariate Forecasts

Can we improve our model by splitting the pollutant we are forecasting into its two types (2.5 and 10), and modeling them in a multivariate manner?

[6]:
# create a dataset:
df_aq['PM10_log10'] = np.log10(df_aq['PM10'])
df_aq['PM2p5_log10'] = np.log10(df_aq['PM2p5'])
dataset_pm_multivariate = TimeSeriesDataset.from_dataframe(
    dataframe=df_aq,
    dt_unit='W',
    measure_colnames=['PM10_log10','PM2p5_log10'],
    group_colname='station',
    time_colname='week'
)
dataset_pm_multivariate_train, _ = dataset_pm_multivariate.train_val_split(dt=SPLIT_DT)

# create a model:
_processes = []
for m in dataset_pm_multivariate.measures[0]:
    _processes.extend([
        LocalTrend(id=f'{m}_trend', measure=m),
        Season(id=f'{m}_day_in_year', period=365.25 / 7, dt_unit='W', K=5, measure=m, fixed=True)
    ])
kf_pm_multivariate = KalmanFilter(measures=dataset_pm_multivariate.measures[0], processes=_processes)

# fit:
kf_pm_multivariate.fit(
    dataset_pm_multivariate_train.tensors[0],
    start_offsets=dataset_pm_multivariate_train.start_datetimes
)
[6]:
KalmanFilter(processes=[LocalTrend(id='PM10_log10_trend'), Season(id='PM10_log10_day_in_year'), LocalTrend(id='PM2p5_log10_trend'), Season(id='PM2p5_log10_day_in_year')], measures=['PM10_log10', 'PM2p5_log10'])

We can generate our 4-step-ahead predictions for validation as we did before:

[7]:
with torch.no_grad():
    pred_4step = kf_pm_multivariate(
        dataset_pm_multivariate.tensors[0],
        start_offsets=dataset_pm_multivariate.start_datetimes,
        n_step=4
    )
pred_4step.means.shape
[7]:
torch.Size([12, 210, 2])

At this point, though, we run into a problem: we we have forecasts for both PM2.5 and PM10, but we ultimately want a forecast for their sum. With untransformed data, we could take advantage of the fact that sum of correlated normals is still normal:

[8]:
torch.sum(pred_4step.means, 2)
[8]:
tensor([[3.9723, 4.0158, 4.0289,  ..., 3.8664, 3.9616, 4.0321],
        [3.9723, 4.0158, 4.0289,  ..., 3.7751, 3.8638, 3.9208],
        [3.9723, 4.0158, 4.0289,  ..., 3.7689, 3.8624, 3.9209],
        ...,
        [3.9723, 4.0158, 4.0289,  ..., 3.8679, 3.9543, 4.0209],
        [3.9723, 4.0158, 4.0289,  ..., 3.8622, 3.9517, 4.0192],
        [3.9723, 4.0158, 4.0289,  ..., 3.9409, 4.0311, 4.1021]])

In our case this unfortunately won’t work: we have log-transformed our measures. This seems like it was the right choice (i.e. our residuals look reasonably normal and i.i.d):

[9]:
pred_4step.plot(pred_4step.to_dataframe(dataset_pm_multivariate, type='components').query("process=='residuals'"))
Subsetting to groups: ['Shunyi']
/home/docs/checkouts/readthedocs.org/user_builds/torchcast/envs/latest/lib/python3.7/site-packages/plotnine/facets/facet.py:393: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.
../_images/examples_air_quality_16_2.png
[9]:
<ggplot: (8767450191265)>

In this case, we can’t take the sum of our forecasts to get the forecast of the sum, and there’s no simple closed-form expression for the sum of lognormals.

One option that is fairly easy in torchcast is to use a Monte-Carlo approach: we’ll just generate random-samples based on the means and covariances underlying our forecast. In that case, the sum of the PM2.5 + PM10 forecasted-samples is the forecasted PM sum we are looking for:

[10]:
# generate draws from the forecast distribution:
mc_draws = 10 ** torch.distributions.MultivariateNormal(*pred_4step).rsample((500,))
# sum across 2.5 and 10, then mean across draws:
mc_predictions = mc_draws.sum(-1, keepdim=True).mean(0)

# convert to a dataframe and summarize error:
_df_pred = TimeSeriesDataset.tensor_to_dataframe(
    mc_predictions,
    times=dataset_pm_multivariate.times(),
    group_names=dataset_pm_multivariate.group_names,
    group_colname='station',
    time_colname='week',
    measures=['predicted']
)
df_multivariate_error = _df_pred.\
    merge(df_aq.loc[:,['station', 'week', 'PM']]).\
    assign(
        error = lambda df: np.abs(df['predicted'] - df['PM']),
        validation = lambda df: df['week'] > SPLIT_DT
    ).\
    groupby(['station','validation'])\
    ['error'].mean().\
    reset_index()
df_multivariate_error.groupby('validation')['error'].agg(['mean','std'])
[10]:
mean std
validation
False 63.06858 2.392846
True 60.35954 5.037340

We see that this approach has reduced our error: substantially in the training period, and moderately in the validation period. We can look at the per-site differences to reduce common sources of noise and see that the reduction is consistent (it holds for all but one site):

[11]:
df_multivariate_error.\
    merge(df_univariate_error, on=['station', 'validation']).\
    assign(error_diff = lambda df: df['error_x'] - df['error_y']).\
    boxplot('error_diff', by='validation')
[11]:
<AxesSubplot:title={'center':'error_diff'}, xlabel='validation'>
../_images/examples_air_quality_20_1.png
[ ]: