Tidy time series analysis and forecasting

Accuracy evaluation

10th June 2024 @ UseR! 2024

Mitchell O’Hara-Wild, Nectric

Evaluating models

Inspecting model errors

Accurate models have small errors.

Good models capture all patterns in the data.

Model errors contain patterns not captured by the model!

library(fpp3)
vic_tourism <- tourism |> 
  filter(State == "Victoria") |> 
  summarise(Trips = sum(Trips))
fit <- vic_tourism |> 
  model(
    ETS(Trips), 
    ARIMA(log(Trips))
  )
augment(fit)

Evaluating models

# A tsibble: 160 x 6 [1Q]
# Key:       .model [2]
   .model     Quarter Trips .fitted .resid   .innov
   <chr>        <qtr> <dbl>   <dbl>  <dbl>    <dbl>
 1 ETS(Trips) 1998 Q1 6010.   5457.  554.   0.101  
 2 ETS(Trips) 1998 Q2 4795.   4824.  -28.6 -0.00594
 3 ETS(Trips) 1998 Q3 4317.   4370.  -52.8 -0.0121 
 4 ETS(Trips) 1998 Q4 4675.   4841. -167.  -0.0344 
 5 ETS(Trips) 1999 Q1 5304.   5599. -295.  -0.0526 
 6 ETS(Trips) 1999 Q2 4562.   4545.   16.8  0.00369
 7 ETS(Trips) 1999 Q3 3784.   4139. -356.  -0.0859 
 8 ETS(Trips) 1999 Q4 4201.   4395. -194.  -0.0441 
 9 ETS(Trips) 2000 Q1 5567.   5055.  512.   0.101  
10 ETS(Trips) 2000 Q2 4502.   4468.   33.8  0.00757
# ℹ 150 more rows

Fitted values and residuals

  • \(\hat{y}_t\): 1-step in-sample fits (forecasts) are in .fitted
  • \(e_t\): response (actual) errors are in .resid
  • \(\varepsilon_t\): innovation (model) errors are in .innov

Plotting in-sample fits

vic_tourism |> 
  autoplot(Trips) + 
  autolayer(augment(fit), .fitted)

Fitted values and response residuals

The response residuals are the difference between actual and predicted values: \(e_t = y_t - \hat{y}_t\).

Summarising model accuracy

Response residuals are often used to calculate accuracy ‘measures’.

Common accuracy measures can be calculated with accuracy().

accuracy(fit)
# A tibble: 2 × 10
  .model            .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
  <chr>             <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 ETS(Trips)        Training  45.0  247.  194. 0.611  3.98 0.712 0.697  0.0259
2 ARIMA(log(Trips)) Training  57.8  243.  193. 1.02   3.99 0.705 0.685 -0.0253

Summarising model accuracy

These point forecast accuracy measures are:

  • ME: Mean error (indicates bias)

  • RMSE: Root mean squared error

    (forecast mean accuracy)

  • MAE: Mean absolute error

    (forecast median accuracy)

  • MPE/MAPE: Percentage errors (problematic, instead use…)

  • MASE: Mean absolute scaled error

    (scaled median accuracy)

Summarising model accuracy

Exercise 10

Evaluate ETS and ARIMA forecast accuracy for total Takings of Australian tourist accommodation (aus_accommodation_total).

Forecasting accuracy

Evaluating out-of-sample forecasts gives more realistic forecast accuracy results.

For this, we create a training dataset which withholds data for evaluating accuracy.

Then, we produce forecasts from the model that overlap with the withheld ‘test’ data.

fc <- vic_tourism |> 
  # Keep some data for evaluating forecasts
  filter(Quarter < yearquarter("2016 Q1")) |> 
  model(
    ETS(Trips), 
    ARIMA(log(Trips))
  ) |> 
  forecast(h = "2 years")

Forecasting accuracy

Then we can again use accuracy() with our forecasts:

accuracy(fc, vic_tourism)
# A tibble: 2 × 10
  .model            .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA(log(Trips)) Test   604.  756.  626.  9.47  9.88  2.53  2.31 0.438
2 ETS(Trips)        Test   571.  725.  596.  8.94  9.43  2.41  2.22 0.432

Include the data

Unlike model accuracy, forecasts don’t know what the actual values are.

You need to pass in the full dataset to accuracy().

i.e. accuracy(<forecasts>, <full_data>).

Forecasting accuracy

Exercise 11

Evaluate ETS and ARIMA forecast accuracy for total Takings of Australian tourist accommodation (aus_accommodation_total).

  1. Withhold 4 years of data for forecast evaluation,
  2. Estimate ETS and ARIMA models on the filtered data,
  3. Produce forecasts for the 4 years of withheld data,
  4. Evaluate forecast accuracy using accuracy().

Which model is more accurate for forecasting?

Does this differ from the in-sample model accuracy?

Forecasting accuracy

Small sample problems!

Test sets in time series can be problematic.

Here we’re judging the best model based on just the most recent 2 years of data.

Cross-validation accuracy

To overcome this, we can use time-series cross-validation.

This creates many training sets from which we produce many forecasts from different starting points.

Cross-validation accuracy

The stretch_tsibble() function can be added after filter() to create time-series cross-validation folds of data.

fc_cv <- vic_tourism |> 
  # Keep some data for evaluating forecasts
  filter(Quarter < yearquarter("2016 Q1")) |> 
  # Cross-validate the remaining data
  stretch_tsibble(.step = 4*2, .init = 4*10) |> 
  model(
    ETS(Trips), 
    ARIMA(log(Trips))
  ) |> 
  forecast(h = "2 years")

Cross-validation accuracy

Once more, we again use accuracy() with our forecasts:

accuracy(fc_cv, vic_tourism)
# A tibble: 2 × 10
  .model            .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA(log(Trips)) Test   219.  471.  370.  3.62  6.98  1.49  1.44 0.614
2 ETS(Trips)        Test   191.  456.  351.  3.06  6.66  1.42  1.39 0.589

Exercise 12

Calculate cross-validated forecast accuracy for total Takings of Australian tourist accommodation (aus_accommodation_total).

How do these results differ from the forecast accuracy calculated earlier?

Residual diagnostics

augment(fit)
# A tsibble: 160 x 6 [1Q]
# Key:       .model [2]
   .model     Quarter Trips .fitted .resid   .innov
   <chr>        <qtr> <dbl>   <dbl>  <dbl>    <dbl>
 1 ETS(Trips) 1998 Q1 6010.   5457.  554.   0.101  
 2 ETS(Trips) 1998 Q2 4795.   4824.  -28.6 -0.00594
 3 ETS(Trips) 1998 Q3 4317.   4370.  -52.8 -0.0121 
 4 ETS(Trips) 1998 Q4 4675.   4841. -167.  -0.0344 
 5 ETS(Trips) 1999 Q1 5304.   5599. -295.  -0.0526 
 6 ETS(Trips) 1999 Q2 4562.   4545.   16.8  0.00369
 7 ETS(Trips) 1999 Q3 3784.   4139. -356.  -0.0859 
 8 ETS(Trips) 1999 Q4 4201.   4395. -194.  -0.0441 
 9 ETS(Trips) 2000 Q1 5567.   5055.  512.   0.101  
10 ETS(Trips) 2000 Q2 4502.   4468.   33.8  0.00757
# ℹ 150 more rows

Recall the ‘innovation’ (model) residuals from augment(), .innov.

Residual diagnostics

Innovation residuals

Innovation residuals contain patterns that weren’t captured by the model.

They aren’t so useful for summarising accuracy since they might be transformed.

With .innov, we can use visualisation to look for patterns. Ideally we don’t find any because this means the model captured everything.

Residual diagnostics

Consider a regression model without seasonality.

fit <- vic_tourism |> 
  model(TSLM(Trips ~ trend()))
augment(fit) |> 
  gg_season(.innov)

Residual diagnostics

Pattern found - seasonality detected

Here we can see that the seasonality wasn’t captured by this model, and can still be found in the residuals.

fit <- vic_tourism |> 
  model(TSLM(Trips ~ trend() + season()))
augment(fit) |> 
  gg_season(.innov)

Residual diagnostics

Checking assumptions

It is good practice to check model assumptions.

All model’s we’ve seen today assume \(\varepsilon_t \overset{\mathrm{iid}}{\sim} N(0, \sigma^2)\).

To check this, we can use gg_tsresiduals().

Exercise 13

Follow along

Let’s check the model assumptions for our ETS and ARIMA models on total Australian accommodation Takings.

fit <- aus_accommodation_total |> 
  model(
    ETS(Takings),
    ARIMA(log(Takings))
  )

Residual diagnostics

fit |> 
  select(`ETS(Takings)`) |> 
  gg_tsresiduals()

Residual diagnostics

fit |> 
  select(`ARIMA(log(Takings))`) |> 
  gg_tsresiduals()

That’s all we have time for!

Learn more about forecasting

  • Forecasting principles and practices textbook

Freely available online! https://otexts.com/fpp3/

I appreciate your feedback

Unsplash credits

Thanks to these Unsplash contributors for their photos