Background reading: Ch. 7, ISL
Not more than 10 years ago, solar power was seen as an expensive and marginal generation technology, most suited for niche applications–like off-grid cabins. This situation has changed dramatically and rapidly. Solar power is now the world’s fastest growing generation technology. In sunny locales, it is often cheaper than traditional coal or even gas generation.
In this lab we will be analysing data on roof-top solar power installations in one of the early large-scale markets for the technology: California. This market was helped in large part by generous state subsidies called, collectively, The California Solar Iniative.
The data set is from the California Solar Iniative, which can be found here. But we will use a data set that I have cleaned and wrangled. This data set stretches from 2006 to 2015, which was a period with a lot of dynamism in the industry.
A data key for the variables can be found here
Notably, the Chinese government made a sustained push to make China a leading solar panel producer nation following the economic fallout of the 2008 financial crisis. Around 2012, the first large waves of Chinese panels started hitting the world markets, and led to a period of falling panel prices.
This period was also marked with a lot of activity at the contractor level, with many new actors offering to install solar panels, with several competing business models.
In this lab and the next we will analyze these trends, while in the process learning some techniques within statistics and machine learning for estimating non-linear trends and for improving the predictive performance of regression models. These techniques come under the names of splines, local regression (Loess), Lasso and regularization, and generalized additive models.
We’ll start by loading in some packages we will use. Tidyverse and lubridate we have seen before. Later in the lab we will load in some new packages as needed.
library(tidyverse)
library(lubridate)
Then we read in the datafile directly from my website
pv_df= read_csv("http://jmaurit.github.io/analytics/labs/data/pv_df.csv")
## Rows: 106623 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): manufacturer, csi_id, prog_administrator, program, incentive_desi...
## dbl (29): csi_cost, nameplate, csi_rating, cec_ptc_rating, incentive_amount...
## lgl (3): MASH_1A, MASH_1B, MASH_2
## date (2): csi_complete, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We limit the data to the years after 2006 and before 2015. The California Solar Initiative started in 2007 and the funds were largely exhausted by 2015 (many years earlier than expected).
The subsidies were designed to be stepped down gradually over time as capacity grew. The subsidies were also designed to be proportional to the electricity generated by the solar panels.
pv_df = pv_df %>% filter(year<2015 & year>2006)
We’ll start by charting the cost of systems over time. Notice the alpha parameter in geom_point. You can play with this number to see what this does. Also notice the geom_smooth command that fits a linear regression line over the data.
pv_df %>% ggplot(aes(x=date, y=cost_per_kw)) +
geom_point(alpha=.01) +
geom_smooth(method="lm") +
ylim(.25e4,1.25e4)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1408 rows containing non-finite values (stat_smooth).
## Warning: Removed 1408 rows containing missing values (geom_point).
A couple of things stick out in this figure.
There is of course the clear downwards trend in cost. But we might also notice that the variance of costs seems to have been reduced over time. Can you explain this?
The other thing you might notice are the strange horizontal “stripes.” Do you have any explanation for this?
We can use dplyr to get a picture of some of the aggregate trends
capacity = pv_df %>% arrange(desc(date)) %>% group_by(date) %>% summarise(
newCapacity = sum(nameplate)
)
capacity["cumCapacity"] = cumsum(capacity$newCapacity)
ggplot(capacity, aes(x=date, y=cumCapacity)) +
geom_line()
This chart is a little misleading, because it starts from zero. California already had some solar power before 2006. However, most of current capacity comes from investments that took place after that.
Now lets look at some aggregated trends of cost per month.
First we’ll create a month-year variable to group on.
pv_df["monthdate"] = as.Date(paste(pv_df$year, pv_df$month, "1", sep="-"))
The cost variable cost_ex_subsid_per_kw is poorly named (by me). This is the cost after the California subsidy was applied. We’ll aggregate both cost and cost less subsidy to the month-year average.
cost = pv_df %>% arrange(desc(monthdate)) %>% group_by(monthdate) %>% summarise(
avgCost = mean(cost_per_kw),
avgCost_less_sub = mean(cost_ex_subsid_per_kw)
)
Then we can chart these two together:
cost %>% arrange(desc(monthdate)) %>% dplyr::select(monthdate, avgCost, avgCost_less_sub) %>% pivot_longer(-monthdate, names_to = "variable", values_to = "values") %>% ggplot(aes(x=monthdate, y=values, color=variable)) + geom_line()
Here we see a fairly flat trend until about 2010, followed by a steep and sustained fall between about 2010-2013. This was the period when Chinese panels started entering the market.
The other thing we see clearly here is how the value of the subsidy (per kw) became substantially reduced. Most of the increased capacity that came after 2012 was driven by lower market prices.
An interesting exercise is to estimate the aggregate learning curve of solar panel system prices. This is the idea that prices are pushed lower as manufacturers and installers get more experienced and can increase their scale. This is usually estimated in terms of price reductions relating to a certain increase in production - most often a doubling of capacity.
First we need to create a new variable representing the cumulative capacity (since 2006).
pv_df = pv_df %>% arrange(date) %>% mutate(
cum_cap = cumsum(nameplate)
)
We can start by plotting the base 2 log of cost with the base 2 log of cumulative capacity. The slope of this relationship can be interpreted as the linear learning curve. (Why?)
pv_df %>% ggplot(aes(x=log2(cum_cap), y=log2(cost_per_kw))) +
geom_point(alpha=.1) +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
Does the estimated line seem like a good fit?
In order to formally estimate the line shown above, we need to make sure there are no 0 values in the data, since we can not take a log of 0.
zerovalue = pv_df %>% filter(cost_per_kw==0)
zerovalue$csi_cost
## [1] 0 0 0 0 0 0 0 0 0
Here we see about a handful of cases where cost_per_kw was reported to be 0. We can assume that these are data errors and we can delete these from our dataset.
pv_df = pv_df %>% filter(cost_per_kw!=0)
Now lets try estimating the relationship
pv_df["log2_cum_cap"] = log2(pv_df$cum_cap)
pv_df["log2_cost_per_kw"] = log2(pv_df$cost_per_kw)
learning_mod1 = lm(log2_cost_per_kw~log2_cum_cap, data=pv_df)
summary(learning_mod1)
##
## Call:
## lm(formula = log2_cost_per_kw ~ log2_cum_cap, data = pv_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4678 -0.1843 -0.0452 0.1787 3.7602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4508764 0.0129701 1191 <2e-16 ***
## log2_cum_cap -0.1630599 0.0007215 -226 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3438 on 106612 degrees of freedom
## Multiple R-squared: 0.3239, Adjusted R-squared: 0.3239
## F-statistic: 5.107e+04 on 1 and 106612 DF, p-value: < 2.2e-16
How do we intrepret these results?
Audio: Estimating a learning curve
We can create a prediction from this. First we want to extrapolate the growth of cumulative capacity. We’ll again do this through a simple linear model, this time using the base-2 log of cum capacity.
pv_df %>% ggplot(aes(x=monthdate, y=log2_cum_cap)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
cap_mod = lm(log2_cum_cap ~ monthdate, data=pv_df)
Visually, this model looks ok, after the initial start-up period.
Are there ways we could improve upon this estimation?
Now we’ll predict capacity 5 years ahead.
new_data = tibble(
monthdate = seq(ymd("2015-01-01"), ymd("2019-12-31"), by="months")
)
new_data["log2_cum_cap"] = predict(cap_mod, newdata=new_data)
pv_df %>% ggplot(aes(x=monthdate, y=cum_cap)) +
geom_line() +
geom_line(aes(x=monthdate, y=2^log2_cum_cap), data=new_data, color="red")
Does this seem like a reasonable prediction for future capacity? Why or why not?
There appears to be a small jump at 2015 - why does the prediction model have this behavior?
Now we (naively?) predict cost based on the capacity prediction:
interval=predict(learning_mod1, newdata = new_data, interval="prediction")
new_data["log2_cost_pred"] = interval[,1]
new_data["lwr"] = interval[,2]
new_data["upr"] = interval[,3]
First we make the prediction based on our model and our input data (predicted cumulative capacity).
new_data %>% ggplot(aes(x=monthdate, y=2^log2_cost_pred)) +
geom_line() +
geom_ribbon(aes(ymin=2^lwr, ymax=2^upr), alpha=.5, fill="grey") +
theme_light()
Does this seem reasonable? Do some googling for what the “system cost” currently is on average for solar power systems in California. Does the forecast appear to have been roughly correct?
We noticed above that even with the log transformation, the linear estimation was not wholly satisfactory. Another option is to estimate a non-linear learning curve. You can read more about several non-linear estimators in ISL ch 7.1-7.6. But here we will begin with the simple polynomial estimator, and then move on to the regression spline.
The first, and perhaps simplest form of a non-linear estimation is a polynomial regression (ISL 7.1). This type of model is still linear in the parameters, but the data is modeled as having a polynomial form.
For example, if we fit a fourth-degree polynomial of our model, it would be written as
\(log2(cost)_i=\alpha + \beta_1 log2(cap_i) +\beta_2 log2(cap_i)^2 +\beta_3 log2(cap_i)^3 +\beta_4 log2(cap_i)^4 +\epsilon_i\)
We can estimate this with the following line:
learning_mod2 = lm(log2_cost_per_kw~poly(log2_cum_cap,4), data=pv_df)
summary(learning_mod2)
##
## Call:
## lm(formula = log2_cost_per_kw ~ poly(log2_cum_cap, 4), data = pv_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2901 -0.1898 -0.0258 0.1629 3.7066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.253e+01 9.774e-04 12819.623 < 2e-16 ***
## poly(log2_cum_cap, 4)1 -7.770e+01 3.191e-01 -243.471 < 2e-16 ***
## poly(log2_cum_cap, 4)2 -3.924e+01 3.191e-01 -122.954 < 2e-16 ***
## poly(log2_cum_cap, 4)3 -1.427e+01 3.191e-01 -44.724 < 2e-16 ***
## poly(log2_cum_cap, 4)4 1.304e+00 3.191e-01 4.087 4.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3191 on 106609 degrees of freedom
## Multiple R-squared: 0.4175, Adjusted R-squared: 0.4175
## F-statistic: 1.91e+04 on 4 and 106609 DF, p-value: < 2.2e-16
And we can visualise the fit of the line.
pv_df["fittedmod2"] = fitted(learning_mod2)
pv_df %>% ggplot(aes(x=log2(cum_cap), y=log2(cost_per_kw))) +
geom_point(alpha=.1) +
geom_line(aes(x=log2(cum_cap), y=fittedmod2), color="red")
Does this seem reasonable?
Again, the fit looks to be reasonable, though we might notice that the right and left ends tend to have worse fit. This is common when using polynomials, so you should be careful when using a polynomial model to predict or interpret a model at the edges of the range.
We could improve the fit of this model by increasing the number of polynomial terms. But increasing the number of terms above 3 or 4 is usually frowned upon. The reason is that it will tend to “overfit” the data - that is the estimate has high variance–and will lead to poor out-of-sample prediction.
The above points should lead us to be somewhat wary of using this fit to make forecasts of future price reductions. But, let us see what happens.
We will again try to predict cumulative capacity out in time, but this time we will also use a polynomial regression to make this prediction:
pv_df = pv_df %>% mutate(monthdateNum = as.numeric(monthdate))
Here I convert the monthdate variable to a numeric equivalent
cap_mod2 = lm(log2_cum_cap ~ poly(monthdateNum,4), data=pv_df)
pv_df["fitted_cap"] = fitted(cap_mod2)
pv_df %>% ggplot(aes(x=monthdate, y=log2_cum_cap)) +
geom_point() +
geom_line(aes(x=monthdate, y=fitted_cap), color="red")
The fitted model looks pretty good. Now lets create our prediction:
new_data2 = tibble(
monthdate = seq(ymd("2015-01-01"), ymd("2019-12-31"), by="months"),
monthdateNum = as.numeric(monthdate)
)
new_data2["log2_cum_cap"] = predict(cap_mod2, newdata=new_data2)
pv_df %>% ggplot(aes(x=monthdate, y=cum_cap)) +
geom_line(aes(x=monthdate, y=2^fitted_cap), color="green") +
geom_line(aes(x=monthdate, y=cum_cap)) +
geom_line(aes(x=monthdate, y=2^log2_cum_cap), data=new_data2, color="red")
So we actually don’t need to go any further in this direction. We see that using our polynomial function to predict capacity is probably a bad idea. And again, we see here a major problem with polynomials: They tend to have poor fit at the edges of the data, and this also means that their out-of-sample predictive fit tends to be poor. In the language of ISL, they tend to have small bias but high variance, which leads to poor predictive performance.
So instead of using our polynomial to predict capacity, we will use our original linear predictor, and instead use our polynomial model only to predict costs:
new_data2["log2_cum_cap"] = predict(cap_mod, newdata=new_data2) #original linear capacity model to predict cum capacity.
new_data2["log2_cost_pred"] = predict(learning_mod2, newdata = new_data2)
interval=predict(learning_mod2, newdata = new_data2, interval="prediction")
new_data2["lwr"] = interval[,2]
new_data2["upr"] = interval[,3]
new_data2 %>% ggplot(aes(x=monthdate, y=2^log2_cost_pred)) +
geom_line() +
geom_ribbon(aes(ymin=2^lwr, ymax=2^upr), alpha=.5, fill="grey") +
theme_light()
Using the polynomial model, we get predictions of a much more aggressive fall in prices than our linear model.
Which one of the predictions seems more realistic?
What are the strengths and weaknesses of each of the models?
A more modern way of estimating a non-linear curve would be to use splines or local regression. This is covered in ISL, chapter 7.3-7.6, which you can read to get a somewhat more thorough explanation than I give here.
Splines are a subset of a group of methods that have a similar approach.
They first break up the range (the x-values) of a function into parts, as defined by the number of break points, called “knots”.
Then they apply a “basis function”–for example a cubic polynomial regression–to each of those ranges separately.
The ends of each one of the parts is tied together (hence the term “knots”) so that the function over the whole range appears continuous.
Finally, the function is appropriately “smoothed” to give best fit - that is adjust by how much the function “wiggles” in relation to variance in the data.
The nice thing about this class of models is that you can easily adjust the ratio of fit (low bias) to predictive performance (low variability), either by adjusting the number of pieces to break the function in to (knots) or adjusting the smoothness level.
However, for making out-of-sample predictions, splines can suffer from some of the same problems as our cubic regression model, as we’ll see.
In order to fit splines in r, we use the splines package:
#install.packages("splines")
library(splines)
Here we will fit a cubic regression spline–that is the basis function is a cubic polynomial–with knots specified to be at 7, 10, 13 and 16
learning_mod3 = lm(log2_cost_per_kw~bs(log2_cum_cap, knots=c(7, 10, 13, 16)), data=pv_df)
fittedmod3 = fitted(learning_mod3)
pv_df %>% ggplot(aes(x=log2(cum_cap), y=log2(cost_per_kw))) +
geom_point(alpha=.1) +
geom_line(aes(x=log2(cum_cap), y=fittedmod3), color="red")
It seems to provide a decent fit. We could adjust it to our liking by increasing or reducing the number of knots, changing the basis function or fitting a smoothed spline (see ISL 7.5). But for now, we’ll just stay with this form.
How do we interpret this line? Does this potentially say something important generally about learning curves?
Here we see that the price fall came relatively late in the learning curve. Only when there became a significant amount of scale, did we begin to see the sustained fall in prices.
Again using our linear prediction of capacity, we make a prediction of the learning curve going out five years:
new_data3 = tibble(
monthdate = seq(ymd("2015-01-01"), ymd("2019-12-31"), by="months")
)
new_data3["log2_cum_cap"] = predict(cap_mod, newdata=new_data3)
new_data3["log2_cost_pred"] = predict(learning_mod3, newdata=new_data3)
## Warning in bs(log2_cum_cap, degree = 3L, knots = c(7, 10, 13, 16),
## Boundary.knots = c(5.20320115631661, : some 'x' values beyond boundary knots may
## cause ill-conditioned bases
interval=predict(learning_mod3, newdata = new_data3, interval="prediction")
## Warning in bs(log2_cum_cap, degree = 3L, knots = c(7, 10, 13, 16),
## Boundary.knots = c(5.20320115631661, : some 'x' values beyond boundary knots may
## cause ill-conditioned bases
new_data3["lwr"] = interval[,2]
new_data3["upr"] = interval[,3]
Notice the warning message that pops up. This is basically saying that since we are asking for a prediction outside of the original range we made estimates for, the predictions could be–and this is not a technical term–weird.
new_data3 %>% ggplot(aes(x=monthdate, y=2^log2_cost_pred)) +
geom_line()
See–weird.
1.) Estimate separate learning curves for pre-2012 and post-2012 (doing log-linear estimation is fine). Can you do this with a single regression? What would be the advantages and disadvantages of doing so?
2.) Estimate the relationship between cumulative capacity and solar power costs with a local linear regression, or LOESS. (See section 7.6 and 7.82 in ISL). How is local linear regression similar to and different from splines. Use local linear regression create a point forecast of costs from 2015 to 2020. Does this suffer from the same problems as the Spline? (hint, in order to create out-of-sample forecasts with the loess() command, you will need to add the following parameter to the command control=loess.control(surface=“direct”)).
3.) Download an updated dataset from the California Solar Iniative. You want to use the dataset called the CSI Working Data Set. Use this data to test how accurate the predictions were from the models we ran above in the lab. You may want to refer to ISL ch. 2.2. and ch. 5.1.1. Can you improve on the goodness-of-fit of the model?
4.) Open-ended question. Find another (energy) data set where it would be of interest to apply a non-linear model. Try at least two different types of models (from among 7.1-7.6). What are the advantages of each? Disadvantages? Can you use the model to make a prediction or forecast? Show how and interpret.