Carbon permit prices and a dynamic regression model.

In the previous lab, we considered forecasting just a single time series. But of course, we are often interested in the effect of one variable on another. When we are using time series data, we can refer to this as dynamic time series regression.

The question we will try to answer in this lab is first, how prices on carbon quotas in the European emissions trading scheme (ETS) affect power prices in the Nordic market. You can read more about ETS here, but let us quickly review some main points about what the ETS is and the theory of why it should effect power prices (even if power is produced from near 100% carbon-free sources in the Nordic market.)

Lets load some packages again, and then load in the data frame we put together in the previous lab that included carbon prices.

library(tidyverse)
library(fpp3)
library(lubridate)

We should have saved the file somewhere on our local disk:

power_df = read_csv("data/power_df.csv")
## Rows: 128 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (21): Bergen, DK1, DK2, EE, ELE, FI, KT, Kristiansand, LT, LV, Molde, O...
## lgl   (1): FRE
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(power_df)[23] = "ets_price"

Lets visualise our ets series again:

ggplot(power_df, aes(x=date, y=ets_price)) +
  geom_line()

We see that the ETS price collapsed after 2008 and then even further after 2012. What happened in these periods that could explain the movements?

What happened around approximately 2018 that led to a resurgence of the ETS price? Has the surge in the ETS price had any immediate effects on generation in Europe? (You might want to do some searching online for information)

We can also ask what happens to carbon prices under the current crisis? You can look at the website of EMBER for an updated view

Audio: ETS prices

Now let us again consider prices in DK1. First, we convert our data frame to tsibble format and then we will convert the unit of prices to Euro/kWh, just so it will be of a similar scale to the ets prices (Euro/tonn carbon).

power_df$date = yearmonth(power_df$date)
power_ts = tsibble(power_df, index=date)
power_ts = power_ts %>% mutate(
  DK1 = DK1/1000
)

First let us run a somewhat naive model, where the error terms of the model are modelled as an AR(2) process and we use ets as an exogenous regressor.

(Notice that I wrote that the error terms are modeled as an AR(2) term, and not the price series itself. This is the default behavior in the function ARIMA, and you can read about why here).

armax1 = power_ts %>% fill_gaps() %>%
  model(ARIMA(DK1 ~ ets_price + pdq(2,0,0)))
report(armax1)
## Series: DK1 
## Model: LM w/ ARIMA(2,0,0) errors 
## 
## Coefficients:
##          ar1     ar2  ets_price  intercept
##       0.5296  0.2811     0.1332     2.3471
## s.e.  0.0857  0.0859     0.0228     0.3359
## 
## sigma^2 estimated as 0.2261:  log likelihood=-85.59
## AIC=181.18   AICc=181.67   BIC=195.48

The results appear reasonable, a 1 Euro increase in the carbon price is associated with a .13 EURO increase in the electricity price (per kWh).

There are some problems with this regression model, as it stands, but let us first create some scenarios based on this estimation.

#scenarios

#no change - take last value in series and extend by 12 months
scen1 = new_data(power_ts, 12) %>% mutate(
  ets_price = rep(power_ts$ets_price[128],12)
) 

#constant increase of 0.5 EUR per month
scen2 = new_data(power_ts, 12) %>% mutate(
  ets_price = rep(power_ts$ets_price[128],12)  +  cumsum(rep(.5,12))
) 

Then we create two forecasts based on our two scenarios:

armax1_forecast1 = forecast(armax1, new_data=scen1)
armax1_forecast2 = forecast(armax1, new_data=scen2)

First the forecast where we have included a constant carbon permit price.

armax1_forecast1 %>% autoplot(power_ts)

Does this seem reasonable?

Then the forecast with a rising CO2 price:

armax1_forecast2 %>% autoplot(power_ts)

From Lab 5, we know that there is some doubt about whether we should consider the price series stationary or not, but when we include an exogenous regressor in a time series regression, the exogenous variable should also be stationary in order to give correct inference.

Looking at the carbon price series, we may well doubt whether the series is stationary. A test confirms this:

library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(power_ts$ets_price)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  power_ts$ets_price
## Dickey-Fuller = -0.82221, Lag order = 5, p-value = 0.9573
## alternative hypothesis: stationary

So what we want instead is a model where we take the first difference of both the y-variable (power prices) and the x-variable (carbon permit prices).

We could do this manually - first creating the differenced series for both power prices and carbon prices, but luckily by specifying that the model is I(1) in our Arima() function, a difference of both series will automatically be done.

Here we can also run a comparison model without an exogenous regression, and check if the exogenous regressor does actually improve the goodness of fit.

armax2 = power_ts %>% fill_gaps() %>% model(
  modWithEts = ARIMA(DK1 ~ ets_price + pdq(2,1,0)),
  modWOutEts = ARIMA(DK1 ~ pdq(2,1,0))
  )

We can compare the fit (AIC) of the two models:

glance(armax2) %>% arrange(AICc)
## # A tibble: 2 × 8
##   .model     sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 modWithEts  0.233   -87.1  182.  182.  194. <cpl [2]> <cpl [0]>
## 2 modWOutEts  0.269   -96.7  199.  200.  208. <cpl [2]> <cpl [0]>

What we see is that the AIC and BIC values are lower in the model with carbon prices, and this indicates a better fit to the data, so we are comfortable saying that including carbon prices improves the predictive performance of our model.

We check our residuals:

armax2 %>% 
  select(modWithEts) %>%
  gg_tsresiduals()
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Now we again create our forecasts:

fcast3 = armax2 %>% select(modWithEts) %>% forecast(new_data=scen1)
fcast4 = armax2 %>% select(modWithEts) %>% forecast(new_data=scen2)
fcast3 %>% autoplot(power_ts)

fcast4 %>% autoplot(power_ts)

Here we see that the higher carbon prices are associated with a prediction of a positive growth in the prices, but with quite a bit of uncertainty.

Advanced seasonality

Now we move back to some more advanced techniques for modelling complex seasonality, which has a tendency to pop up in power markets.

We’ll load in hourly consumption data for Norway as a whole and its 5 price areas and clean it up a bit (in the previous lab we had daily consumption data):

cons = read_csv2("http://jmaurit.github.io/analytics/labs/data/consumption-no-areas_2019_hourly.csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 8761 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (2): Date, Hours
## dbl (6): NO1, NO2, NO3, NO4, NO5, NO
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#cons["Date"] = as.Date(cons$Date, format="%d/%m/%Y")
cons = cons %>% separate(Hours, sep="-", into=c("start", "end"))

#we use lubridate to create a date-time columns
cons["period"] = dmy_h(paste(cons$Date, cons$start))

#We have one missing value - I will fudge it and replace it with the previous hours value
cons[["NO"]][cons$period==as_datetime("2019-03-31 02:00:00")] = cons[["NO"]][cons$period==as_datetime("2019-03-31 01:00:00")]

#And we have one duplicate hour
duplicates(cons)
## Using `period` as index variable.
## # A tibble: 2 × 10
##   Date       start end     NO1   NO2   NO3   NO4   NO5    NO period             
##   <chr>      <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>             
## 1 27/10/2019 02     03    3288  3903  2909  2013  1648 13760 2019-10-27 02:00:00
## 2 27/10/2019 02     03    3255  3864  2952  2003  1611 13685 2019-10-27 02:00:00
dupRow = duplicates(cons)[2,]
## Using `period` as index variable.
cons = cons %>% rows_delete(dupRow, by=c("period", "NO"))
## Ignoring extra `y` columns: Date, start, end, NO1, NO2, NO3, NO4, NO5
#convert to tsibble
cons_ts = cons %>% select("NO1":"period") %>% tsibble(index=period)

lets take a look:

cons_ts %>% select(NO) %>% autoplot()
## Plot variable not specified, automatically selected `.vars = NO`

With so much data, it is hard to make out the patterns at smaller frequencies.

Lets try to look at a smaller interval of just a month:

cons_ts %>%
  dplyr::filter((period>=as_datetime("2019-11-01 00:00:00")) & (period<=as_datetime("2020-01-01 00:00:00"))) %>%  autoplot(NO)

What we see more clearly here is two levels of seasonality. We have the daily-24 hour cycle of electricity consumption: low during the night, a spike in mid-day, and then a lull. We also have the weekly seasonality of lulls during the weekend. If we had multiple years of data, then we could also clearly see strong yearly patterns of seasonal change.

To handle the extra seasonality we will use fourier analysis, where we basically decompose a time series into sine and cosine terms of varying frequency. You can read more here.

As a simple example to start with. Lets just model the daily (24 hour) seasonality. Lets say that the short-term dynamics can be modeled as an AR2 model. Then we can write the model as:

\(C_t = \beta_1 C_{t-1} + \beta_2 C_{t-2} + sin(\frac{2\pi kt}{24}) + cos(\frac{2\pi kt}{24})\)

#short_cons = ts(window(cons_ts, start="2019-11-01", end="2020-01-01"), frequency=24)

smod = cons_ts %>% model(
  fmod1 = ARIMA(NO ~ fourier(K=1) + pdq(2,0,0) + PDQ(0,0,0))
)


forecast1 = smod %>% forecast(h=24*14)


forecast1 %>%  autoplot(cons_ts[cons_ts$period>as_date("2019-10-01 00:00:00"),], level = 95)

Here we see that we get the basic daily pattern. Let us experiment with higher orders of fourier analysis in order to capture the more complicated daily seasonality. I will also leave out the pdq() parameters in order to allow the algorithm to choose the optimal ARMA dynamics, though I will specify PDQ(0,0,0) so that the seasonality is modelled through the fourier term.

smod2 = cons_ts %>% model(
  fmod1 = ARIMA(NO ~ fourier(K=1) +  PDQ(0,0,0)),
  fmod2 = ARIMA(NO ~ fourier(K=2) +  PDQ(0,0,0)),
  fmod3 = ARIMA(NO ~ fourier(K=3) +  PDQ(0,0,0)),
  fmod4 = ARIMA(NO ~ fourier(K=4) +  PDQ(0,0,0)),
  fmod5 = ARIMA(NO ~ fourier(K=5) + PDQ(0,0,0)),
  fmod6 = ARIMA(NO ~ fourier(K=6) + PDQ(0,0,0)),
  fmod7 = ARIMA(NO ~ fourier(K=7) + PDQ(0,0,0)),
  fmod8 = ARIMA(NO ~ fourier(K=8) + PDQ(0,0,0))
)
smod2 %>%
  forecast(h = 24*14) %>%
  autoplot(cons_ts[cons_ts$period>as_datetime("2019-09-01 00:00:00"),], level = 95) +
  facet_wrap(vars(.model), ncol = 2) +
  guides(colour = FALSE, fill = FALSE, level = FALSE) +
  geom_label(
    aes(x = as_datetime("2019-10-01 00:00:00"), y = 20000, label = paste0("AICc = ", format(AICc))),
    data = glance(smod2)
  )
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

The highest order of fourier analysis gave us the lowest AIC score. We could possibly investigate higher orders as well, but for now, let us look closer at an 8th order fourier model:

#[cons_ts$period>as_datetime("2019-09-01 00:00:00"),]
smod2 %>% select(fmod8) %>% 
  forecast(h = 24*14) %>%
  autoplot(cons_ts %>% dplyr::filter(period>as_datetime("2019-09-01 00:00:00")), level = 95)

It doesn’t look too bad, but we notice that our forecasted time series still appears much more regular than the actual time series. This is, to a certain extent, too be expected, and that is one of the reasons for the uncertainty bands. But it may also be a sign that we need a more complex fourier model.

What we want to do now is to explicitly model both the daily and weekly seasonality. You can read more here

#create a new fourier series
smod3 = cons_ts %>%
  model(
    fmod1 = ARIMA(NO ~ PDQ(0, 0, 0) +
                fourier(period = 24, K = 8) + fourier(period = 24*7, K = 5))
  )

forecast2 = smod3 %>% forecast(h = 24*14)

forecast2 %>% autoplot(cons_ts %>% dplyr::filter(period>as_datetime("2019-09-01 00:00:00")))

Now the forecast looks much better. My choice of order for the two seasonal terms was somewhat arbitrary here. For multiple seasonalities, AIC will tend to overestimate the number of orders you need. In practice, you will probably have to experiment until you find orders that seem to do a good job of mimicking the seasonality.

Vector Autoregressives models

In the first section in this lab, we included a second series as an exogenous regressor. We took it as a given that the series was in fact exogenous. That is, that a change in the price of emissions permits caused a change in power prices (in Denmark). We did not consider the possibility that higher prices caused higher permit prices. Or, more realistically, that both higher permit prices and power prices were caused by an unobserved variable. For example, that a stronger European economy causes both higher emissions prices and higher power prices. If this were the case, we could no longer give a causal interpretation to our regression. Worse, our forecast scenarios of power prices given higher emissions prices might be misleading.

Such endogeneity problems can be fiendishly hard to sort out. One tool we can use to try to get some clarity is called a Vector Autoregressive Model (VAR). This allows us to treat essentially every one of the variables as an endogenous variable: affecting and being affected by the other variables.

VAR models can give us forecasts that are more realistic and display more complex dynamics. The downside is that these models can be difficult to interpret.

Daily consumption and price

We’ll take an almost obvious example: prices and consumption on the power market. We know that that periods of high consumption can lead to high prices. But we also know that high prices will probably affect consumption, even though electricity markets are known to be quite price inelastic in the short term (why?).

So lets import daily consumption and price data.

cons_daily = read_csv2("http://jmaurit.github.io/analytics/labs/data/consumption-per-country_2019_daily.csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 365 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): date
## dbl (9): NO, SE, FI, DK, Nordic, EE, LV, LT, Baltic
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cons_daily["date"] = as.Date(cons_daily$date, format="%d/%m/%Y")
prices_daily = read_csv2("http://jmaurit.github.io/analytics/labs/data/elspot_prices_2019_daily.csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): date
## dbl (22): SYS, SE1, SE2, SE3, SE4, FI, DK1, DK2, Oslo, Kr.sand, Bergen, Mold...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prices_daily["date"] = as.Date(prices_daily$date, format="%d/%m/%Y")

Then, selecting out consumption for Norway, and prices in Oslo, we join the series into a data frame we call NO_df.

NO_df = prices_daily %>% dplyr::select(date, Oslo) %>% inner_join(dplyr::select(cons_daily, date, NO), by="date")
colnames(NO_df)[2:3] = c("Oslo_price_EUR_MWH", "NO_cons_MWH")

Now, instead of a model of consumption on price or price on consumption, we want to allow each to affect each other. So a model with one lag would look like:

\(C_t = a_1 + \phi_{11} C_{t-1} + \phi_{12}P_{t-1} + \epsilon_{1,t}\) \(P_t = a_2 + \phi_{21} C_{t-1} + \phi_{22}P_{t-1} + \epsilon_{1,t}\)

We need to put our series into tibble format:

NO_ts = tsibble(NO_df, index=date)

By now we can probably guess that our price and consumption series will not be stationary, so in our modeling we will have to difference the data. We also transform to a log to both make the series a bit more linear and to give our regression coefficients an interpretation of elasticities.

NO_ts = NO_ts %>% mutate(
  log_dprice = difference(log(Oslo_price_EUR_MWH)), 
  log_dcons = difference(log(NO_cons_MWH))
)
NO_ts %>% select(log_dprice) %>% autoplot()
## Plot variable not specified, automatically selected `.vars = log_dprice`
## Warning: Removed 1 row(s) containing missing values (geom_path).

NO_ts %>% select(log_dcons) %>% autoplot()
## Plot variable not specified, automatically selected `.vars = log_dcons`
## Warning: Removed 1 row(s) containing missing values (geom_path).

Here we fit our model, allowing the VAR algorithm automatically choose the optimal number of lags for each variable. I use the report() command to show the estimates, but in general, it is hard to give any meaningful economic interpretations to such VAR models.

varMod = NO_ts %>%
  model(
    mod1 = VAR(vars(log_dprice, log_dcons))
  )

varMod %>% report()
## Series: log_dprice, log_dcons 
## Model: VAR(5) 
## 
## Coefficients for log_dprice:
##       lag(log_dprice,1)  lag(log_dcons,1)  lag(log_dprice,2)  lag(log_dcons,2)
##                 -0.4055            0.2723            -0.2939           -0.0236
## s.e.             0.0589            0.1045             0.0626            0.1042
##       lag(log_dprice,3)  lag(log_dcons,3)  lag(log_dprice,4)  lag(log_dcons,4)
##                 -0.1391           -0.1846            -0.1720            0.0729
## s.e.             0.0643            0.1124             0.0626            0.1053
##       lag(log_dprice,5)  lag(log_dcons,5)
##                 -0.0791           -0.3272
## s.e.             0.0587            0.1048
## 
## Coefficients for log_dcons:
##       lag(log_dprice,1)  lag(log_dcons,1)  lag(log_dprice,2)  lag(log_dcons,2)
##                 -0.0294            0.0463            -0.0182           -0.3887
## s.e.             0.0314            0.0557             0.0334            0.0555
##       lag(log_dprice,3)  lag(log_dcons,3)  lag(log_dprice,4)  lag(log_dcons,4)
##                 -0.0664           -0.1023            -0.0658           -0.0931
## s.e.             0.0343            0.0599             0.0333            0.0561
##       lag(log_dprice,5)  lag(log_dcons,5)
##                 -0.0617           -0.3255
## s.e.             0.0313            0.0559
## 
## Residual covariance matrix:
##            log_dprice log_dcons
## log_dprice     0.0051    0.0011
## log_dcons      0.0011    0.0015
## 
## log likelihood = 1150.27
## AIC = -2252.54   AICc = -2248.95 BIC = -2159.27

We can look at the autocorrelations of our residuals from our model:

varMod %>%
  augment() %>%
  ACF(.innov) %>%
  autoplot()

We can see that we failed to fully account for the weekly seasonality in our consumption data. But we will leave that aside for now and look at our forecasts:

varMod %>%
  forecast(h=14) %>%
  autoplot(NO_ts %>% dplyr::filter(date>as_date("2019-09-01")))

We of course get our forecast in differences as well, but it would be easy enough to transform this into levels again.

Assignment

  1. In a dynamic regression model, it may make sense to include lagged variables as exogenous regressors. In the model of DK1 prices, include both contemporaneous and lagged carbon permit prices. How does this change your model? (You may want to read Ch 10.6 in fpp3).

  2. From ENTSOE-E or statnett, download hourly consumption data for Norway for 2017 and 2018. Join this with the 2019 data in order to create one long time series for Norwegian consumption. Then model the seasonality in the data (at monthly, weakly and daily level), with fourier terms.

  3. Create a VAR model for consumption and prices in 2019 using Danish data (You can find it on ENTSOE_E or at the Danish TSO’s energy data site. Create a 30 day forecast. Load in actual data for january 2020–how does your forecast look? Include wind power in Denmark as a variable. How does this affect the model and forecast?