Multivariate Forecasts: Beijing Multi-Site Air-Quality Data

We’ll demonstrate several features of torchcast using a dataset from the UCI Machine Learning Data Repository. It includes data on air pollutants and weather from 12 sites.

[2]:

df_aq = load_air_quality_data('weekly')

SPLIT_DT = np.datetime64('2016-02-22')

df_aq

loading from gh...

[2]:

	week	station	PM2p5	PM10	SO2	NO2	CO	O3	TEMP	PRES	DEWP	RAIN	WSPM
0	2013-02-25	Aotizhongxin	38.263889	57.791667	36.541667	56.750000	958.236111	37.583333	2.525000	1022.777778	-15.666667	0.0	2.130556
1	2013-02-25	Changping	31.986111	47.152778	31.870773	44.969203	870.355072	45.925725	1.915278	1019.633333	-16.231944	0.0	1.475000
2	2013-02-25	Dingling	28.083333	37.816919	15.955314	34.916667	627.838164	49.222222	1.915278	1019.633333	-16.231944	0.0	1.475000
3	2013-02-25	Dongsi	45.083333	60.680556	31.687198	60.160024	1165.036836	43.635870	2.268056	1024.697222	-16.913889	0.0	1.775000
4	2013-02-25	Guanyuan	38.472222	55.208333	35.467995	62.855676	1075.746981	38.277778	2.525000	1022.777778	-15.666667	0.0	2.130556
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2515	2017-02-27	Nongzhanguan	46.807359	76.195652	13.304348	56.108696	1032.608696	41.152174	9.647917	1016.014583	-9.964583	0.0	1.787500
2516	2017-02-27	Shunyi	47.020833	58.625000	13.750000	61.791667	1033.333333	40.750000	8.289583	1016.695833	-8.525000	0.0	1.827083
2517	2017-02-27	Tiantan	43.662500	70.429167	10.577273	64.052381	1012.954545	37.800000	9.647917	1016.014583	-9.964583	0.0	1.787500
2518	2017-02-27	Wanliu	36.979167	60.750000	12.663949	65.371377	1061.322464	35.964015	8.868750	1014.450000	-9.300000	0.0	1.558333
2519	2017-02-27	Wanshouxigong	39.550595	55.199405	9.835404	57.981366	993.788820	37.146998	9.647917	1016.014583	-9.964583	0.0	1.787500

2520 rows × 13 columns

Univariate Forecasts

Let’s try to build a model to predict total particulate-matter (PM2.5 and PM10).

First, we’ll make our target the sum of these two types. We’ll log-transform since this is strictly positive.

[3]:

from torchcast.process import LocalTrend, Season

# create a dataset:
df_aq['PM'] = df_aq['PM10'] + df_aq['PM2p5']
df_aq['PM_log10'] = np.log10(df_aq['PM'])
dataset_pm_univariate = TimeSeriesDataset.from_dataframe(
    dataframe=df_aq,
    dt_unit='W',
    measure_colnames=['PM_log10'],
    group_colname='station',
    time_colname='week'
)
dataset_pm_univariate_train, _ = dataset_pm_univariate.train_val_split(dt=SPLIT_DT)

# create a model:
kf_pm_univariate = KalmanFilter(
    measures=['PM_log10'],
    processes=[
        LocalTrend(id='trend'),
        Season(id='day_in_year', period=365.25 / 7, dt_unit='W', K=5, fixed=True)
    ]
)

# fit:
kf_pm_univariate.fit(
    dataset_pm_univariate_train.tensors[0],
    start_offsets=dataset_pm_univariate_train.start_datetimes
)

[3]:

KalmanFilter(processes=[LocalTrend(id='trend'), Season(id='day_in_year')], measures=['PM_log10'])

Let’s see how our forecasts look:

[4]:

# helper for transforming log back to original:
def inverse_transform(df):
    df = df.copy()
    # bias-correction for log-transform (see https://otexts.com/fpp2/transformations.html#bias-adjustments)
    df['mean'] = df['mean'] + .5 * (df['upper'] - df['lower']) / 1.96 ** 2
    # inverse the log10:
    df[['actual', 'mean', 'upper', 'lower']] = 10 ** df[['actual', 'mean', 'upper', 'lower']]
    df['measure'] = df['measure'].str.replace('_log10', '')
    return df

# generate forecasts:
forecast = kf_pm_univariate(
        dataset_pm_univariate_train.tensors[0],
        start_offsets=dataset_pm_univariate_train.start_datetimes,
        out_timesteps=dataset_pm_univariate.tensors[0].shape[1]
)

df_forecast = inverse_transform(forecast.to_dataframe(dataset_pm_univariate))
print(forecast.plot(df_forecast, max_num_groups=3, split_dt=SPLIT_DT))

Subsetting to groups: ['Gucheng', 'Tiantan', 'Guanyuan']

Evaluating Performance: Expanding Window

To evaluate our forecasts, we will not use the long-range forecasts above. Instead, we will use an expanding window approach to evaluate a shorter forecast horizon. In this approach, we generate N-step-ahead forecasts at every timepoint:

title

This approach is straightforward in torchcast, using the n_step argument. Here we’ll generate 4-week-ahead predictions. Note that we’re still separating the validation time-period.

[5]:

with torch.no_grad():
    pred_4step = kf_pm_univariate(
        dataset_pm_univariate.tensors[0],
        start_offsets=dataset_pm_univariate.start_datetimes,
        n_step=4
    )

df_univariate_error = pred_4step.\
    to_dataframe(dataset_pm_univariate, group_colname='station', time_colname='week').\
    pipe(inverse_transform).\
    merge(df_aq.loc[:,['station', 'week', 'PM']]).\
    assign(
        error = lambda df: np.abs(df['mean'] - df['actual']),
        validation = lambda df: df['week'] > SPLIT_DT
    ).\
    groupby(['station','validation'])\
    ['error'].mean().\
    reset_index()
df_univariate_error.groupby('validation')['error'].agg(['mean','std'])

[5]:

	mean	std
validation
False	70.568771	2.194658
True	61.550529	4.278333

Multivariate Forecasts

Can we improve our model by splitting the pollutant we are forecasting into its two types (2.5 and 10), and modeling them in a multivariate manner?

[6]:

# create a dataset:
df_aq['PM10_log10'] = np.log10(df_aq['PM10'])
df_aq['PM2p5_log10'] = np.log10(df_aq['PM2p5'])
dataset_pm_multivariate = TimeSeriesDataset.from_dataframe(
    dataframe=df_aq,
    dt_unit='W',
    measure_colnames=['PM10_log10','PM2p5_log10'],
    group_colname='station',
    time_colname='week'
)
dataset_pm_multivariate_train, _ = dataset_pm_multivariate.train_val_split(dt=SPLIT_DT)

# create a model:
_processes = []
for m in dataset_pm_multivariate.measures[0]:
    _processes.extend([
        LocalTrend(id=f'{m}_trend', measure=m),
        Season(id=f'{m}_day_in_year', period=365.25 / 7, dt_unit='W', K=5, measure=m, fixed=True)
    ])
kf_pm_multivariate = KalmanFilter(measures=dataset_pm_multivariate.measures[0], processes=_processes)

# fit:
kf_pm_multivariate.fit(
    dataset_pm_multivariate_train.tensors[0],
    start_offsets=dataset_pm_multivariate_train.start_datetimes
)

[6]:

KalmanFilter(processes=[LocalTrend(id='PM10_log10_trend'), Season(id='PM10_log10_day_in_year'), LocalTrend(id='PM2p5_log10_trend'), Season(id='PM2p5_log10_day_in_year')], measures=['PM10_log10', 'PM2p5_log10'])

We can generate our 4-step-ahead predictions for validation as we did before:

[7]:

with torch.no_grad():
    pred_4step = kf_pm_multivariate(
        dataset_pm_multivariate.tensors[0],
        start_offsets=dataset_pm_multivariate.start_datetimes,
        n_step=4
    )
pred_4step.means.shape

[7]:

torch.Size([12, 210, 2])

At this point, though, we run into a problem: we we have forecasts for both PM2.5 and PM10, but we ultimately want a forecast for their sum. With untransformed data, we could take advantage of the fact that sum of correlated normals is still normal:

[8]:

torch.sum(pred_4step.means, 2)

[8]:

tensor([[3.9723, 4.0158, 4.0289,  ..., 3.8664, 3.9616, 4.0321],
        [3.9723, 4.0158, 4.0289,  ..., 3.7751, 3.8638, 3.9208],
        [3.9723, 4.0158, 4.0289,  ..., 3.7689, 3.8624, 3.9209],
        ...,
        [3.9723, 4.0158, 4.0289,  ..., 3.8679, 3.9543, 4.0209],
        [3.9723, 4.0158, 4.0289,  ..., 3.8622, 3.9517, 4.0192],
        [3.9723, 4.0158, 4.0289,  ..., 3.9409, 4.0311, 4.1021]])

In our case this unfortunately won’t work: we have log-transformed our measures. This seems like it was the right choice (i.e. our residuals look reasonably normal and i.i.d):

[9]:

pred_4step.plot(pred_4step.to_dataframe(dataset_pm_multivariate, type='components').query("process=='residuals'"))

Subsetting to groups: ['Shunyi']

/home/docs/checkouts/readthedocs.org/user_builds/torchcast/envs/latest/lib/python3.7/site-packages/plotnine/facets/facet.py:393: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.

../_images/examples_air_quality_16_2.png

[9]:

<ggplot: (8767450191265)>

In this case, we can’t take the sum of our forecasts to get the forecast of the sum, and there’s no simple closed-form expression for the sum of lognormals.

One option that is fairly easy in torchcast is to use a Monte-Carlo approach: we’ll just generate random-samples based on the means and covariances underlying our forecast. In that case, the sum of the PM2.5 + PM10 forecasted-samples is the forecasted PM sum we are looking for:

[10]:

# generate draws from the forecast distribution:
mc_draws = 10 ** torch.distributions.MultivariateNormal(*pred_4step).rsample((500,))
# sum across 2.5 and 10, then mean across draws:
mc_predictions = mc_draws.sum(-1, keepdim=True).mean(0)

# convert to a dataframe and summarize error:
_df_pred = TimeSeriesDataset.tensor_to_dataframe(
    mc_predictions,
    times=dataset_pm_multivariate.times(),
    group_names=dataset_pm_multivariate.group_names,
    group_colname='station',
    time_colname='week',
    measures=['predicted']
)
df_multivariate_error = _df_pred.\
    merge(df_aq.loc[:,['station', 'week', 'PM']]).\
    assign(
        error = lambda df: np.abs(df['predicted'] - df['PM']),
        validation = lambda df: df['week'] > SPLIT_DT
    ).\
    groupby(['station','validation'])\
    ['error'].mean().\
    reset_index()
df_multivariate_error.groupby('validation')['error'].agg(['mean','std'])

[10]:

	mean	std
validation
False	63.06858	2.392846
True	60.35954	5.037340

We see that this approach has reduced our error: substantially in the training period, and moderately in the validation period. We can look at the per-site differences to reduce common sources of noise and see that the reduction is consistent (it holds for all but one site):

[11]:

df_multivariate_error.\
    merge(df_univariate_error, on=['station', 'validation']).\
    assign(error_diff = lambda df: df['error_x'] - df['error_y']).\
    boxplot('error_diff', by='validation')

[11]:

<AxesSubplot:title={'center':'error_diff'}, xlabel='validation'>

../_images/examples_air_quality_20_1.png

[ ]: