Even more so than oil and gas exploration, electricity markets have a lot of structure, and some basic understanding of how the electricity markets and engineering works is important in being able to interpret the data. Markets and systems vary across the globe, and it’s hard to provide a comprehensive review. But since the data we will be going through in the lab is from the Nordic market I’ll give a short overview of that market.
Traditionally, the Nordic countries organized their electricity systems as vertically integrated, government owned utilities. What we now know as the generation company Statkraft was an integral part of the Norwegian regulator NVE. In other words, one giant government entity controlled electricity generation, balancing, transmission, and retailing. A similar situation existed in the other Nordic countries.
In 1986 NVE in Norway was split up into component parts: the part responsible for owning and operating electricity plants and the transmission network became Statkraft, while the regulator of the system maintained the name NVE.
In 1990 and continuing throughout the decade, electricity generation and retailing were deregulated. Statkraft was split into two: A generation business that had to compete with other companies to produce electricity (but which was still state-owned), and Statnett, which was a wholly government-owned monopoly responsible for the transmission network and balancing.
Regional generation companies (Like BKK and TrønderEnergi) could now compete with each other to provide the lowest-cost electricity to consumers, given the availability of transmission.
As part of the deregulation, retailing was split from generation. Thus retailers would buy electricity from the wholesale market, and then were free to compete with each other based on the prices and contracts they offered consumers and businesses.
Deregulation spread from Norway to the other Nordic countries and the Norwegian wholesale market grew into a common electricity market across the Nordic countries called Nord Pool, where electricity generators could compete with each other across national borders. Currently Nord Pool encompasses all the Nordic countries except Iceland and has continued to expand so that, currently, companies from 16 countries trade on Nord Pool. You can read more on the website of Nordpool.
Electricity as a commodity has some quirks that differentiate it from other commodities like oil, coal or gas, and these need to be taken into account when thinking about the markets for electricity. I am not an power systems engineer, so I will state these properties fairly imprecisely, but it should be good enough for our purposes.
1.) Electricity can not be stored. Once electricity is generated, it flows through the network until it does work: Like lighting a lamp, turning an electric motor, or heating a room. It can not be stored. You can transform electricity into other forms of energy that can be stored: Chemical energy of a battery or potential energy in the form of water in a reservoir, but electricity itself can not be stored.
2.) Supply and demand in an electricity network needs to always be in balance. This follows more or less from the first property. If the demand for electricity at any given time is higher than generation, then this can lead to a drop in the frequency in the network, and potentially to severely damaging the network’s infrastructure as well as generation assets and even appliances. Similar problems can result from generation outstripping demand. A big job of the transmission system operator is therefor to ensure that supply and demand are equal, second for second. This is what is referred to as “balancing” and “frequency regulation”.
3.) Electricity will flow in all possible directions in a network. In an AC network (which is the modern universal standard), electricity generated will flow through all possible directions in a network. That means that if you want to see how much net electricity generated in Denmark ends up in Norway, you also have to consider “loop-flows” of electricity that travel through Sweden to Norway. A notable exception here are what are called “High Capacity Direct Current (HCDC)” cables, where electricity can be directed in a certain direction along a certain route.
The Nordic market is a so-called “zonal” market. That means that the market is split up into zones or areas where a uniform electricity price is established for each hour of the day. Norway has 5 zones, Sweden 4, Denmark 2 and Finland just 1. You can see a map of the zones here. (By the way, the other major geographical pricing system is called “Nodal” market, which allows prices to vary at a much finer level depending on the state of transmission and the marginal generation costs at a location (node), but it also becomes necessarily more complex and less transparent. Within electricity market research, exploring which system is best in different scenarios is a major topic of research. Mette Bjørndal at NHH has written extensively on this.)
There are several distinct markets for trading electricity, but the main market run by Nord Pool is the day-ahead market (sometimes misleadingly called “Nord Pool Spot”.) Up until 1200 the day before delivery, producers submit supply schedules - basically telling Nord Pool how much they are willing to produce for any given price for each hour of the following day.
Wholesale consumers (like retailing companies and large commercial consumers) submit demand schedules - saying how much they want to buy at given prices for each hour of the day.
These supply and demand schedules are then used to create aggregated supply and demand curves for the entire market, where a price that clears the market is established (that is, a price is established where producers sell everything they want given the price, and consumers buy everything they want given the price.)
But this “system price” assumes no congestion and transmission bottlenecks in the system, and in particular between the price areas, which there almost always are. Therefor, Nord Pool adjusts the individual zonal prices until the market clears in each price zone. Thus if there is congestion between eastern Denmark (DK2) and southern Sweden (SE4), with Denmark having too much supply at the system price and Sweden having too little, prices will rise in Sweden and fall in Denmark until supply meets demand. Notably, electricity will always have a net-flow from high price areas to low-price areas.
Most of the electricity trade in the Nordic market happens on the day-ahead market, but markets for trading closer to real time are essential for maintaining balance in the system and allowing for both producers and wholesale consumers to respond to unexpected demand or supply.
Nord Pool operates an hour-ahead market, sometimes called “Elbas”. This is a continuously traded, bilateral market. That is, trading happens in real-time with suppliers and wholesale consumers submitting supply and demand bids that can be accepted by other participants on the market, up to an hour ahead of delivery. The hour ahead market has become extra important with increased intermittent (wind, solar) power, as producers have to guess/forecast what production will be a day-ahead. When it appears that their forecast is off, producers can go to the hour-ahead market and try to make up the difference.
In addition, the national transmission system operators (Statnett in Norway, Svenske Kraftnätt in Sweden, etc) operate markets for minute-by-minute balancing and frequency reserve.
A lot of analysis of energy markets involves studying data over time. For example, we are interested in how the price of oil changes over time, or how increased wind power will affect market prices.
In the next three labs we will be using time series statistics to study energy markets and to make and interpret forecasts. Some of the material below is based on Modern Econometrics by Wooldridge. But I will mostly refer directly to the discussion in Forecasting: Principles and Practice.
The first lab serves as a quick introduction to the most important issues in time series statistics. If you have taken a time-series course before, feel free to skim through quickly and go on to the assignment.
Let’s start with a simple and intuitive time series model called a distributed lag model. This is where we have a y-variable (or endogenous variable) which we model on an x-variable and its lags. We can write it as:
\(y_t = \alpha + \delta_0 x_t + \delta_1 x_{t-1} + \delta_2 x_{t-2} + u\)
Lets say we want to analyse a temporary “shock” to this system - for example how a temporary jump in the oil price might effect industry profits over time.
Let’s say that the shock happens at time t. We assume that in the periods before t, the x-variable (here oil price) is constant (\(x=c\)).
So in the period before the shock happens we have:
\(y_{t-1} = \alpha_0 + \delta_0 c + \delta_1 c + \delta_2 c\)
Then, in period t we have:
\(y_t = \alpha_0 + \delta_0 (c+1) + \delta_1 c + \delta_2 c\)
And so on:
\(y_{t+1} = \alpha_0 + \delta_0 c + \delta_1 (c+1) + \delta_2 c\)
\(y_{t+2} = \alpha_{0} + \delta_{0} c + \delta_{1} c + \delta_{2} (c+1)\)
\(y_{t+3} = \alpha_0 + \delta_0 c + \delta_1 c + \delta_2 c\)
We can look at the effect of the shock over time, which is called the impact propensity:
time | impact propensity |
---|---|
t | \(y_t-y_{t-1} = \delta_0\) |
t+1 | \(y_{t+1}-y_{t-1} = \delta_1\) |
t+2 | \(y_{t+2} - y_{t-1} = \delta_2\) |
t+3 | \(y_{t+3} - y_{t-1} = 0\) |
Now let us say that we want to analyse a permanent effect. For example the effect of a constant CO2 tax on the electricity price.
Again in period t we have:
\(y_t = \alpha_0 + \delta_0 (c+1) + \delta_1 c + \delta_2 c\)
and then in following years:
\(y_{t+1} = \alpha_0 + \delta_0 (c+1) + \delta_1 (c+1) + \delta_2 c\)
\(y_{t+2} = \alpha_0 + \delta_0 (c+1) + \delta_1 (c+1) + \delta_2 (c+1)\)
\(y_{t+3} = \alpha_0 + \delta_0 (c+1) + \delta_1 (c+1) + \delta_2 (c+1)\)
This is what we call “long-run propensity”, or the permanent effect:
time | long-run propensity |
---|---|
t | \(y_t-y_{t-1} = \delta_0\) |
t+1 | \(y_{t+1}-y_{t-1} = \delta_0 + \delta_1\) |
t+2 | \(y_{t+2} - y_{t-1} = \delta_0 + \delta_1 + \delta_2\) |
t+3 | \(y_{t+3} - y_{t-1} = \delta_0 + \delta_1 + \delta_2\) |
If we want correct standard errors, and in turn inference in our time-series, then we have to take account of possible serial correlation (or autocorrelation) in our error term. You can read more generally about serial correlation here
We define serial correlation as:
\(corr(u_t, u_s | X)\)
where \(t\neq s\)
That is to say that there is correlation over time between the different error terms. Intuitively, if we look at the residuals of a regresion, then without serial correlation they should look random–centered around 0, sometimes a bit positive, sometimes a bit negative, but with no recognizeable pattern.
In econometrics, we often want to log-transform our data
Sometimes we do a log-transform in order to make data more linear
If you transform both side of an equation, it can sometimes make the results easier to interpret–you can interpret the results as an elasticity. A percent change in one variable is correlated to a percent change in another variable.
Let’s say we want to analyse the effects of wind power (W) on prices (P). These two series are in different units, so it may be easier to interpret them as elasticities. We can log-transform my regresion:
\(log(P_t) = \alpha_0 + \delta_0 * log(W_t) + \delta_1*log(W_{t-1}) + \delta_2*log(W_{t-2})\)
\(\delta_0\) is interpreted as a short-term elasticity: \(\delta_0 = \frac{\% \Delta P_t}{\% \Delta W_t}\)
The long-term elasticity would be: \(\delta_0 + \delta_1 + \delta_2\)
We can include dummy-variables in our time series regresion. For example, let us say that we want to estimate the effect of a new environmental rule on oil production. Perhaps then we do the following regresion:
\(log(Q_t) = \beta X_t + \delta D_t + u_t\)
Where \(D_t = [0,0,0,0,1,1,1,1,1,1,1]\), where the 1’s signify the periods after the new rule was put in place.
Often, we want to model a trend in our data.
We can write a linear trend as:
\(y_t = \alpha + \beta*t + u_t\)
We can simulate this in r:
u_t = rnorm(100,0,1) #100 random draws from a normal distribution
alpha = 2.5
beta = .3 # this is our trend coefficient - the series increases by .3 each period, t.
t=seq(1:100)
y = alpha + beta*t+u_t
plot(t,y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
If we model \(log(y_t)\), then we would get an exponential trend (which corresponds to a constant rate of growth over time, rather than a constant absolute increase over time.)
alpha = 2.5
beta = .03 # Now an exponential growth rate - the series increases by 3 percent per period.
y= exp(alpha + beta*t) + u_t
# Notice that if you take (natural) log of both sides, you get log(y) on the left side, and alpha + beta*t plus log(u_t) on the right side.
plot(t, y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
R is an excellent program for doing time-series analysis. We start by importing some useful extra packages.
library(tidyverse)
library(zoo) #provides extra tools for working with time series data and analysis
library(fpp3) #package that comes along with Forecasting: Principles and practice
library(dynlm)
The data we will look at comes from the Nordic power market. We import data on both prices on the day-ahead market, called elspot, and wind production data from Denmark and Sweden:
#Wind production from Sweden and Denmark (i MWh)
wind_prod =read_csv("http://jmaurit.github.io/norwayeconomy/data_series/dk_se_wind_data.csv")
## Rows: 2763 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): DK1, DK2, EE, LV, LT, SE1, SE2, SE3, SE4, FI, DK, SE
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#The price of power (on the day-ahead market) (EUR/MWh)
elspot = read_csv("http://jmaurit.github.io/norwayeconomy/data_series/elspot.csv")
## Rows: 228 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (20): Bergen, DK1, DK2, EE, ELE, FI, KT, Kristiansand, LT, LV, Molde, O...
## lgl (1): FRE
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We’ll start by looking at the two price areas in Denmark: DK1 and DK2. Here I join the data by the date variable:
DK = elspot[c("DK1", "DK2", "date")] %>% inner_join(wind_prod[c("DK1", "DK2", "date")], by="date")
colnames(DK) = c("DK1price","DK2price", "date", "DK1wind", "DK2wind")
tail(DK)
## # A tibble: 6 × 5
## DK1price DK2price date DK1wind DK2wind
## <dbl> <dbl> <date> <dbl> <dbl>
## 1 5214 5337 2018-07-01 11245 1981
## 2 5569 5684 2018-08-01 2489 1775
## 3 4989 5179 2018-09-01 3017 1609
## 4 4748 5090 2018-10-01 32970 6256
## 5 5404 5468 2018-11-01 31766 4920
## 6 4670 4911 2018-12-01 30001 10309
The fpp3 package comes with a handy format for a data set comprising time series called tsibble. This, as the name implies, works with the tidyverse set of tools we have been using.
DK["date"] = yearmonth(DK$date)
DK_ts = as_tsibble(DK, index=date)
DK_ts %>% dplyr::select(DK1wind, DK2wind, date) %>%
pivot_longer(-date, names_to="variable", values_to="value") %>%
ggplot(aes(x=date, y=value, color=variable)) +
geom_line()
We can notice first that our frequency is monthly data. The normal frequency in the electricity data is hourly data, so here we are looking at quite aggregated data.
We can also see that wind power is very variable, even when aggregated monthly.
Here we also see that there is much more wind power generation in the DK1 area. This is the western, peninsula part of Denmark. There is both more land mass and better wind resources in this part, so it makes sense that most of the wind power capacity is in this part. However, a larger part of consumption takes place in the DK2 area – where copenhagen is.
Then we can look at prices:
DK_ts %>% select(DK1price, DK2price, date) %>%
pivot_longer(-date, names_to="variable", values_to="value") %>%
ggplot(aes(x=date, y=value, color=variable)) +
geom_line()
The main thing to notice is that prices in the two areas are quite close. DK2 prices have a tendency to be slightly higher than DK1.
Lets see how wind power and price is correlated in the two price areas.
DK_ts %>% ggplot(aes(x=DK1wind, y=DK1price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
For the DK1 area, there looks like there might be a slight negative relationship between wind power and average prices at the monthly level, but it is hard to say for sure.
Let’s see at DK2:
DK %>% ggplot(aes(x=DK2wind, y=DK2price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In DK2 it is hard to discern any trend.
Lets focus on DK1, and start by running a simple regression by OLS:
model1 = lm(DK1price ~ DK2price + DK1wind + DK2wind, data=DK_ts)
summary(model1)
##
## Call:
## lm(formula = DK1price ~ DK2price + DK1wind + DK2wind, data = DK_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -447.48 -96.24 11.53 89.28 1495.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.409e+01 1.238e+02 0.114 0.910
## DK2price 9.599e-01 3.239e-02 29.634 <2e-16 ***
## DK1wind -7.715e-04 2.766e-03 -0.279 0.781
## DK2wind -4.395e-03 9.569e-03 -0.459 0.648
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233.3 on 68 degrees of freedom
## Multiple R-squared: 0.931, Adjusted R-squared: 0.928
## F-statistic: 306 on 3 and 68 DF, p-value: < 2.2e-16
So here we see that DK2 price is a strong predictor of DK1 prices (which we saw from the graph above.) We get a regression with a very high R-squared (a very good fit). Wind power, as interpreted from this regression, does not appear to effect prices.
But are there things that are wrong with this regression?
Many, in fact:
The DK1 price series probably is not stationary. Stationarity is one of the key concepts in time series and forecasting, and we will come back to it later in this module. You can read more here.
Including prices in DK2 as a predictor probably doesn’t make sense, since prices in DK1 and DK2 are co-determined at the same time in the day-ahead market. By including DK2 we are getting a very high fit, but we are not estimating a particularly useful model.
We have not taken into account any dynamics in the time series, like autocorrelation in the price series.
The data may be too aggregated - we might prefer a finer time scale.
The fpp3 package has some some built-in plotting functions that are built on ggplot2. So to plot a time series plot of the DK1 price we just need to write:
autoplot(DK_ts, DK2price) +
ggtitle("Price in DK1 area")
Lets start by creating first-differences of our data. In our regression, we can then intrepret the results as how a change in one effects a change in another
DK_diff_ts = DK_ts %>% mutate(
DK1price_diff = c(NA, diff(DK1price)),
DK2price_diff = c(NA, diff(DK2price)),
DK1wind_diff = c(NA, diff(DK1wind)),
DK2wind_diff = c(NA, diff(DK2wind))
)
autoplot(DK_diff_ts, DK1price_diff) +
ggtitle("First-difference of DK1 price")
## Warning: Removed 1 row(s) containing missing values (geom_path).
By creating some lagged variables using mutate, we can run some simple dynamic models
DK_diff_ts = DK_diff_ts %>% mutate(
DK1wind_diff_l1 = lag(DK1wind_diff)
)
We run a simple dynamic model, where change in price is modeled as a function of change in wind power and lagged change in wind power.
dynmod1 = lm(DK1price_diff ~ DK1wind_diff + DK1wind_diff_l1-1, DK_diff_ts)
summary(dynmod1)
##
## Call:
## lm(formula = DK1price_diff ~ DK1wind_diff + DK1wind_diff_l1 -
## 1, data = DK_diff_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1127.8 -251.9 51.3 344.0 1101.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## DK1wind_diff -0.0003691 0.0025634 -0.144 0.886
## DK1wind_diff_l1 -0.0018532 0.0025593 -0.724 0.471
##
## Residual standard error: 465.4 on 68 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.00821, Adjusted R-squared: -0.02096
## F-statistic: 0.2814 on 2 and 68 DF, p-value: 0.7556
In this simple model, we can not detect any significant effect of wind power on price. Yet we shouldn’t conclude too much from this. The data is heavily aggregated – at the monthly level, and we don’t have particularly many data points.
Lets now deal directly with two of the issues in the initial linear regression: that the data was too aggregated and the issue of stationarity.
First, we now use hourly data on the Nordic electricity market. The data is originally from the website of Nord Pool, which I have cleaned and organized.
wt_data = read_csv("http://jmaurit.github.io/analytics/labs/data/wt_data2.csv")
## Rows: 26304 Columns: 90
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): hour, hour_ind, month
## dbl (85): wind_SE1, wind_SE2, wind_SE3, wind_SE4, wind_DK1, wind_DK2, SE_nx...
## dttm (2): date, time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data set includes many columns that represent prices, wind power production and transmission between areas. But we want to go back to our original question of how wind power effects prices on the market.
But first let us pull out a few of the series we are interested in, and format them as tsibble
DK = wt_data %>% select(time, hour, wind_DK1, wind_DK2, DK1EurMW, DK2EurMW) %>% as_tsibble(index=time)
DK %>% autoplot(wind_DK1)
DK %>% autoplot(DK1EurMW)
Now let us again create a simple dynamic model by creating lagged series for wind power.
DK = DK %>% mutate(
wind_DK1_l1 = lag(wind_DK1),
wind_DK1_l2 = lag(wind_DK1,2),
wind_DK1_l3 = lag(wind_DK1,3)
)
DK_mod = lm(DK1EurMW~wind_DK1 + wind_DK1_l1 + wind_DK1_l2 + wind_DK1_l3, data=DK)
summary(DK_mod)
##
## Call:
## lm(formula = DK1EurMW ~ wind_DK1 + wind_DK1_l1 + wind_DK1_l2 +
## wind_DK1_l3, data = DK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8662.5 -873.9 -179.3 693.9 10558.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.973e+03 1.398e+01 284.181 < 2e-16 ***
## wind_DK1 -3.389e-01 8.259e-02 -4.103 4.09e-05 ***
## wind_DK1_l1 9.777e-03 1.510e-01 0.065 0.94836
## wind_DK1_l2 5.088e-02 1.510e-01 0.337 0.73611
## wind_DK1_l3 -2.415e-01 8.259e-02 -2.924 0.00345 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1345 on 26284 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.1026, Adjusted R-squared: 0.1024
## F-statistic: 750.9 on 4 and 26284 DF, p-value: < 2.2e-16
Audio: interpreting the distributed lag regression with hourly data.
Now we get a significant and negative coefficient on the wind variable as well as a third lag of the variable. From this regression we could interpret the results as saying that wind power has a tendency to push down prices in the power market. But we still have the potential problem of stationarity. We’ll spend a little time on this before we move on.
Lets first run a regression of the price series on its own lag - a so-called autoregressive (AR) model:
DK = DK %>% mutate(
DK1EurMW_l1 = lag(DK1EurMW)
)
ar_mod1 = lm(DK1EurMW ~ DK1EurMW_l1 -1, data=DK)
summary(ar_mod1)
##
## Call:
## lm(formula = DK1EurMW ~ DK1EurMW_l1 - 1, data = DK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4511.2 -90.5 7.5 102.4 6116.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## DK1EurMW_l1 0.9934795 0.0007034 1412 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 416.1 on 26296 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.987, Adjusted R-squared: 0.987
## F-statistic: 1.995e+06 on 1 and 26296 DF, p-value: < 2.2e-16
The key point from reading this regression is that the coefficient on the lagged term is very close to 1. We can interpret this to mean that a shock in a certain hour, will then carry over nearly completely to the next hour. Shocks to price, in other words, tend not to dissipate quickly. This is what it means to be highly-persistent. And highly-persistent time series tend to be non-stationary, which makes it hard to interpret a regression.
The simplest form of a persistent (and non-stationary) time series is a random walk: you can read more here. The simple random walk model can be writen as:
\[y_t = y_{t-1} + e_t\]
Intuitively, lets say that oil prices are a random walk. That would mean that the price today is the same as the price yesterday plus some random (“stochastic”) jump (either positive or negative). In terms of forecasting, that means the best forecast of the oil price tomorrow is the oil price today!
We can simulate a random walk easily in R:
#We make a matrix that is 50x30
y=matrix(,nrow=50, ncol=30)
#We use a for-loop to fill the matrix with random shocks:
for(s in 1:30){
e = rnorm(50) #this represents our random variable - this command simulates 50 draws from a normal distribution.
y[,s] = cumsum(e) # the y variable is then modelled as the cumulative sum of the 50 shocks (convince yourself that this is the case)
}
y = as_tibble(y)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
ycols = colnames(y)
y["t"] = 1:50
y_l = pivot_longer(y, ycols, names_to="variable", values_to="value")
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(ycols)` instead of `ycols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
ggplot(y_l, aes(x=t, y=value, group=variable)) +
geom_line(alpha=.5)
I will walk through this code in this audio:
The figure shows 30 simulated time series of 50 periods each.
One thing we can notice is how the variance increases over time. This violates one of our criteria for using OLS - a constant variance.
We can also notice how the series tends to move in all sorts of directions. There is no reversion to some well-defined mean, even though all the series come from exactly the same process. This is getting closer to the definition of non-stationarity, and it gives some intuition on why using a non-stationary series can give misleading results. Imagine telling a story about any one of the random draws – one goes up, another down, another nearly straight – but all come from exactly the same random process. All these stories and interpretations of any one of these series is essentially wrong, since really what we see are just random movements.
Another form of random walk, is random walk with drift:
\[y_t = \alpha_0 + y_{t-1} + e_t\] Notice that in a model with a lagged value of the dependent variable, the constant term leads to a constant time trend. You can read the above equation as saying that the value of y is the same as last period y plus a constant term \(\alpha_0\) plus some random shock. Thus \(\alpha_0\) gets added to the series in each period.
y=matrix(,nrow=50, ncol=30)
for(s in 1:30){
e = rnorm(50)
y[,s] = cumsum(2+e)
}
y = as.tibble(y)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
ycols = colnames(y)
y["t"] = 1:50
y_l=gather(y, ycols, key="variable", value="value")
ggplot(y_l, aes(x=t, y=value, group=variable)) +
geom_line(alpha=.5)
We could use a random walk with drift to do a pretty good job of modeling oil prices over time, for example: A general drift upwards over time, but with a variance that increases with time and no stable long-run average trend.
##Stationarity - a bit more formally
A litte bit more formally, we define covariance stationarity as when:
*\(E(x_t)\) is stable
*\(var(x_t)\) is stable
We have already seen some time series that are not stationary:
Time series with a trend (mean is not stable)
Very persistent time series (random walk)
Why do we care about stationarity? Because if we want to learn something about the behaviour of a time series or the interaction of two time series, then those time series should be stable over time.
In a normal regression, we had random sampling – that is, that the order of the data should not matter– as a criteria for making correct inference. But in time-series we don’t have random sampling. By definition, the data is ordered by time.
So instead, as a criteria we have what we call “weak dependence”.
We can write this as:
\(corr(x_t, x_{t+h})\rightarrow0\) as h gets big.
Basically, this says that correlations between data will go to zero as the distance between data points gets large.
Another way of thinking about this is to say that a shock to the system will die out over time and not be permanent.
Lets go back to our simple AR(1) process:
\(y_t = \rho y_{t-1} + e_t\)
Lets set \(\rho=.5\) and simulate:
u=rnorm(100,0,1)
y = (1) #first value
rho = .5 #AR term
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
t=seq(1:100)
plot(t,y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Now lets set \(\rho=1\):
y=(1)
rho = 1 #AR ledd
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
t=seq(1:100)
plot(t,y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Then we’ll set \(\rho=1.05\) to just over 1:
u=rnorm(100,0,1)
y=(1)
rho = 1.05 #AR ledd
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
t=seq(1:100)
plot(t,y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Which of these three look stable?
Audio: stationary, non-stationary, and explosive AR(1) models:
In general, an AR(1) series is weakly dependent and stationary when \(|\rho|<1\).
If we have a highly persistent time series, we will often call it \(I(1)\): integrated of order one, or a unit-root process.
What we want is a \(I(0)\): Integrated of order 0 process, or a weakly persistent process.
A good strategy for transforming an \(I(1)\) series, is to take a difference (this will of course also change our interpretation of the results).
We start by generating a random walk series (non-stationary, I(I)).
u=rnorm(100,0,1)
y = (1) #First value
rho = 1 #AR ledd
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
t=seq(1:100)
plot(t,y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Then we take a first difference using the diff command.
diff_y = diff(y)
plot(t[2:100], diff_y, type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Audio: A simple strategy for dealing with non-stationary data
As an example of what can go wrong with non-stationary data, consider a time series model where you regress one series on another.
As an example, consider the series, \(x_t\) and \(y_t\) which are both random walks:
\[x_t = x_{t-1} + \epsilon_t\]
\[y_t = y_{t-1} + u_t\]
Now, lets simulate both of these series a few times and each time run a regression of the one on the other.
n<-50
e<-rnorm(n)
a<-rnorm(n)
#independent random walks:
x<-cumsum(a)
y<-cumsum(e)
#plot
spur_data = data.frame(x=x, y=y, t=1:n)
spur_data_l = spur_data %>% gather(x, y, key="variable", value="value")
ggplot(spur_data_l, aes(t, value, color=variable)) +
geom_line()
summary(lm(y~x, data=spur_data))
##
## Call:
## lm(formula = y ~ x, data = spur_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4245 -1.9124 -0.2481 1.9825 5.4561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3465 0.3946 -11.016 9.71e-15 ***
## x 1.0785 0.1357 7.946 2.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.738 on 48 degrees of freedom
## Multiple R-squared: 0.5681, Adjusted R-squared: 0.5591
## F-statistic: 63.14 on 1 and 48 DF, p-value: 2.658e-10
Make sure to run this simulation a handful of times, and note the estimated coefficients. What do you notice about the results? Keep in mind, that we simulated these two series as completely random and independent of each other.
Audio: Spurious regression
How do we know if our series is non-stationary? As we have seen, we can often look at the series to get an idea. But it would be better to have a more formal way of testing for stationarity.
If we have a series that we think is AR(1), then we could in principal just estimate the following regression:
\(y_t = \rho y_{t-1} + u_t\)
We can simulate some data and try:
#dynlm lets us quickly run some simple dynamic equations
library(dynlm)
u=rnorm(100,0,1)
y = (1) #Start value
rho = .8 #AR term
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
mod1 = dynlm(y ~ lag(y)-1)
summary(mod1)
##
## Time series regression with "numeric" data:
## Start = 1, End = 99
##
## Call:
## dynlm(formula = y ~ lag(y) - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.55107 -0.77254 -0.02362 0.32391 2.39575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lag(y) 0.73048 0.06864 10.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9718 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.5314
## F-statistic: 113.3 on 1 and 98 DF, p-value: < 2.2e-16
We should get back a coefficient on lag(y) that is pretty close to 0.8. But what happens if \(\rho=1\)?
u=rnorm(100,0,1)
y = (1) #første value
rho = 1 #AR term
for(t in 2:100){
y[t] = rho*y[t-1] + u[t]
}
mod2 = dynlm(y ~ lag(y)-1)
summary(mod2)
##
## Time series regression with "numeric" data:
## Start = 1, End = 99
##
## Call:
## dynlm(formula = y ~ lag(y) - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.36659 -0.77481 -0.00411 0.67950 2.51691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lag(y) 0.98921 0.01846 53.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9732 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.967, Adjusted R-squared: 0.9667
## F-statistic: 2873 on 1 and 98 DF, p-value: < 2.2e-16
Often we might get a coefficient that is close to 1, but if you run the code a few times, you will see that we end up getting results that can vary by quite a lot.
The point is to say, that running a regression like this to test for stationarity is problematic, because if the series IS non-stationary, then the regression is not valid. In other words, running such a regression is only a good test for stationarity if we already know the series is stationary!
Instead, we run a similar regression which, together with a distribution of the test result, we call the Dickey-fuller test:
For a AR(1) model: \(y_t = \rho y_{t-1} + u_t\)
Our Null hypothesis is:
\(H_0: \rho=1\) is non-stationary
versus
\(H_A: \rho<1\)
Then we transform with a difference:
\(y_t - y_{t-1} = \rho y_{t-1} + e_t - y_{t-1}\)
\(\Delta y_{t-1} = (\rho - 1) y_{t-1} + u_t\)
Thereby, if we define \(\theta = \rho-1\), we get the following regresion model:
\(\delta y_t = \theta y_{t-1} + u_t\)
Here the test is:
\(H_0: \theta=0\) \(H_A: \theta<0\)
If we run this test using a distribution of the test statistic called the Dickey-Fuller distribution, then that is the Dickey-Fuller test.
We often have to take into account more dynamics and serial correlation in a series. Then we use what we call the Augmented Dickey-Fuller test.
A differenced series with AR(1) serial correlation (in \(\Delta y_t\)) can be written:
\(\Delta y_t = \theta y_{t-1} + \gamma \Delta y_{t-1} + e_t\)
And generally with p lags:
\(\Delta y_t = \theta y_{t-1} + \gamma_1 \Delta y_{t-1} + ... + \gamma_p \Delta y_{t-p} + e_t\)
and then test \(\theta = 0\) with a Dickey-Fuller test as normal.
Below we can see how we can quickly make use of the Dickey-Fuller test:
###Testing for stationarity and transforming in practice.
In r, we can run a Dickey-Fuller test automatically, by first installing the package tseries, and then using the function adf.test (augmented dickey-fuller test).
install.packages("tseries")
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Here we test the price for DK1:
#the adf.test function does not accept na values, so we get rid of these first
DK1price = DK["DK1EurMW"]
DK1price = DK1price[!is.na(DK1price)]
adf.test(DK1price)
## Warning in adf.test(DK1price): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: DK1price
## Dickey-Fuller = -17.452, Lag order = 29, p-value = 0.01
## alternative hypothesis: stationary
Here we can see that the p-value is at most 0.01 (the test only reports up to two decimal points, so results lower than 0.01 are rounded up.). Since our null hypothesis is non-stationary and our alternative hypothesis is stationary, we can reject our null and assume we have a stationary series.
When a series is correlated over time, as long as it is stationary, we can still model it without bias. However, if we do not correctly account for the serial correlation, it can lead to standard errors that are no longer “efficient” – that is to say that we can no longer count on our standard errors and p-values being correct.
We run a simple regression of (differenced) price on wind power with now with no lags.
From our simple regression we obtain our residuals. We plot our residuals against our “fitted” y-values. That is, for each actual y-value, what does the linear model predict should have been the y-value.
diffmod2 = dynlm(diff(DK1EurMW)~diff(wind_DK1), data=DK)
diffmod2_resid = resid(diffmod2)
diffmod2_fitted = fitted(diffmod2)
#fitted vs. residual plott
plot(diffmod2_resid, diffmod2_fitted)
When we look at a fitted vs. residual plot, we are on the look-out for some pattern, which would indicate that we have not modeled our dynamics appropriately. Unfortunately, there is so much data here, that it is hard to make anything out.
Luckily, we can test for serial correlation in a more formal way. Here we run a regression of the residuals on its own lag:
ar_test = dynlm(diffmod2_resid ~ L(diffmod2_resid))
summary(ar_test)
##
## Time series regression with "numeric" data:
## Start = 1, End = 26297
##
## Call:
## dynlm(formula = diffmod2_resid ~ L(diffmod2_resid))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.155e-12 -1.000e-14 -7.000e-15 -4.000e-15 1.906e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.257e-30 7.250e-15 0.000e+00 1
## L(diffmod2_resid) 1.000e+00 1.744e-17 5.735e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.176e-12 on 26295 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.289e+33 on 1 and 26295 DF, p-value: < 2.2e-16
Here we see clearly that there is dynamics left in the residuals - that is serial correlation. This is a simple version of what is referred to as the Breusch-Godfrey test.
By using the package lmtest, we can test for serial correlation with just one line of code:
library(lmtest)
bgtest(diffmod2)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: diffmod2
## LM test = 1964.2, df = 1, p-value < 2.2e-16
We can also test for serial correlation of a series with n lags:
\[e_t = \alpha_1 e_{t-1} + \alpha_2 e_{t-2} + ... + \alpha_n e_{n} \]
In practice the most common way to check for dynamics and serial correlation in both series and residuals is to look at autocorrelation functions (ACF) (also called a correlelogram) and partial autocorrelation functions (pACF).
#first remove NA values:
mod2_resids = diffmod2_resid[!is.na(diffmod2_resid)]
mod2_resids = ts(mod2_resids)
acf(mod2_resids)
Audio: Reading a ACF/Correlogram
You can think of an ACF as telling the story of how a “shock” propogates through time. If a shock happens at time t=0, how long and in what pattern does it propogate through time.
We can also look at a pACF, which gives us an indication of the dependence structure over time given the effects of other lags. Often, the pACF can give us a good understanding of what type of AR structure we want to have in our model.
pacf(mod2_resids)
Now we have taken a quick tour of time series statistics. I hope you have gotten a sense of the most important issues: stationarity and serial correlation. If I had to summarize very quickly, I might say that most of time series can be divided into two questions:
Is the series stable over time (stationary). If not, then making estimates from the limited sample we have will not be particularly informative, since in the next (unseen) sample period, the time series could look completely different. This is the essence of stationarity. However, if we have a non-stationary series, we can try to transform it to a stationary series. Here differencing is often a simple and effective strategy.
Once we have convinced ourselves that we are dealing with a stationary series, we then want to explore the dynamics (correlations) of the series and try to model that in the best possible way.
It is point 2 that will be our jumping-off point for lab 5. We take series and try to model the dynamics (correlations) of the series through what is called ARIMA modeling. We can then turn around this model to create a forecast.
This forecast is perhaps not that useful in order to get a point estimate of a future value. Instead we can create scenarios and generate uncertainty that can inform us about risk in the future.
In lab 6, we take this modeling one step further and introduce working with exogenous variables. The application we will be working with is estimating the effect of carbon trading permit prices on power prices, and then using such a model to create scenarios for different future levels of permit prices.
Audio: Summary
Are these series stationary? Use and interpret with an appropriate test. Does the stationarity or non-stationarity of these series have some economic interpretation? If so, what does it say about these series?
Estimate what the effect of wind power is on net exchange in these two areas (by using dynlm for example, or potentially another time series regression package). Are there any lagged effects? Should you include any controlling variables? Such as prices, the other areas wind power, etc? Explain why or why not? Interpret the results.
Check the residuals for serial correlation (look at ACF and pACF figures). What effect does this have on our results or interpretation of results? If there are serial correlation, we should probably model the dynamics directly in our model, for example by including autoregressive (AR) terms in our regression. Will that effect how we interpret the exogenous coefficients (wind power)? Chapter 9 parts 1-2 could be a useful reference in answering this question.
Throughout this lab, we have assumed that wind power is exogenous. That is, when we include it as an independent variable, we interpret the coefficients as causal. Is this justified? Why? What if we included price as an exogenous variable in the regresion in c. Should we interpret price as exogenous?