Background reading: Ch. 6, ISL

The determinants of price variation in California.

In this lab, let us start by jumping right into the data. As usual, we start by loading in packages we need.

library(tidyverse)
library(lubridate)

And we again load in the data set of prices in California.

Let’s again look at the scatter plot of cost over time:

pv_df %>% ggplot(aes(x=date, y=cost_per_kw)) + 
  geom_point(alpha=.01) +
  ylim(.25e4,1.25e4)
## Warning: Removed 1408 rows containing missing values (geom_point).

In the above charts, In addition to the downward trend, we see quite a lot of variation in price between systems at any given time. Lets consider a few potential reasons why prices may vary.

In general, we can also see here that we can roughly split the cost factors into two categories. The first is local, short-term factors: Like geographic differences, differences between contractors, etc. Then there are more long-term trends–like the general downward trend of panel prices over time, which we spent time analyzing in lab 7.

We need to create and alter some variables in order to try to estimate some of these effects.

In the variable incentive_amount we get the amount of the total incentive/subsidy that the system owner obtains. But to compare between systems, we would really like to get a per kilowatt number:

pv_df["incentive_per_kw"] = pv_df$incentive_amount/pv_df$nameplate

Here is perhaps a good time to take a quick aside. The variable “nameplate” represents nameplate capacity, a theoretical maximum output of power from the panels at any given time. We remember the distinction in electricity between power: The output at any time, measured in watts, kilowatts, megawatts, etc; and energy, which is measured in watt-hours, kilowatt-hours, etc.

Yet the actual power output from the panels will vary substantially. The weather and position of the sun at any given time will of course be a major factor but so will factors like: The angle and direction of the panels, trees and shades in the vicinity, temperatures (hotter temperatures actually tend to lead to more inefficient panels), supporting equipment like inverters (which turn DC power from solar panels into AC power) and a host of other factors. The full nameplate capacity will then, in practice, never be reached.

Solar panels also tend to degrade over time, becoming less efficient. The degree of degradation has to do with the quality of panels, which can be hard to judge at the time of purchase. I wrote academic article on the subject, which if you are interested, you can read the open version here.

Let’s plot incentives with cost after subsidies:

pv_df %>% ggplot(aes(x=incentive_per_kw, y=log(cost_ex_subsid_per_kw))) +
  geom_point(alpha=.01) +
  geom_smooth(method="lm")
## Warning in log(cost_ex_subsid_per_kw): NaNs produced

## Warning in log(cost_ex_subsid_per_kw): NaNs produced

## Warning in log(cost_ex_subsid_per_kw): NaNs produced
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).

How do we interpret this chart?

What explains the clustered nature of the observations?

Is the upward trend of this chart evidence of cost inflation induced by subsidies, or are there other reasonable explanations?

In the data set I have created variables county_year_total and zip_year_total. Both these give similar information about scale and geography, but zip is a finer metric.

Lets focus on aggregating to the county-year level. We can aggregate up to the county-year for the cost variable before we chart

county_data = pv_df %>% group_by(county, year) %>% summarise(
  avg_county_cost = mean(cost_per_kw, na.rm=TRUE),
  avg_county_cost_less_sub = mean(cost_ex_subsid_per_kw, na.rm=TRUE),
  county_year_total = sum(nameplate, na.rm=TRUE), 
  county_lat = mean(latitude, na.rm=TRUE), 
  county_long = mean(longitude, na.rm=TRUE)
)
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

Lets look at just the last year: 2014

county2014 =county_data %>% filter(year==2014) 

county2014 %>% ggplot(aes(x=county_year_total, y=avg_county_cost)) + geom_point()

No obvious trends here. We can also try to map this relationship geographically (again, consider this part optional if you don’t want to sign up for a google cloud account):

library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
californiaCoord= c(left = -125, bottom = 31.5, right = -113.5, top = 43)
californiaMap = get_stamenmap(californiaCoord, zoom = 6, maptype = "toner-lite")
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
#california=get_map(location = c(lon = -119, lat = 36), zoom=6, maptype="roadmap") 
ggmap(californiaMap) +
geom_point(aes(x = county_long, y = county_lat, size=county_year_total, color=avg_county_cost), data = county2014) +
scale_color_continuous(low="blue", high="red")

This maps makes it somewhat easier to see where the concentrations of solar power systems are, but it is hard to discern any clear geographic patterns in cost for the year 2014. Lets look at an earlier year, and see if we see any differences.

county20102014 =county_data %>% filter(year==2010 | year==2014) 
ggmap(californiaMap) +
geom_point(aes(x = county_long, y = county_lat, size=county_year_total, color=avg_county_cost), data = county20102014) +
facet_wrap(~year) +
scale_color_continuous(low="blue", high="red")

Regression analysis

We can now move on to a more formal analysis, where we wish to estimate the factors that can explain and predict the price variation in our data.

Linear model:

As a baseline, we start with our linear model. We do some transformations of variables, to make regression results a bit easier to read.

For county-year totals and contractor year totals we divide by 1000, so that regression results can be interpreted at the MW units instead of KW units. Contractor market share is multiplied by 100 so that we can interpret as a percentage point.

pv_df$county_year_total_mw = pv_df$county_year_total/1000
pv_df$contractor_year_total_mw = pv_df$contractor_year_total/1000
pv_df$contractor_market_share_perc = pv_df$contractor_market_share*100

For a review of multiple linear regression, see ch. 3.2 of ISL (p. 71).

In our model we will include the following variables:

  • date: date of system completion

  • county: The County where the system was installed.

  • sector: The sector of the host: residential, business, government, non profit

  • county_year_total_mw: total amount installed that year in the same county

  • incentive_per_kw: the incentive amount per kw - this should measure the indirect effect since we are measuring the cost excluding incentives.

  • contractor_year_total_mw - how much a contractor has installed that year. Here we are trying to find out if there are cost differences associated with the size of the contractor

  • lease - was the solar panel system leased or host-owned.

  • china - were the panels produced in China.

An important point about interpreting this regression is that the coefficients are estimated conditional on the date (as well as conditional on the other variables). Thus, we are analyzing the variation in prices, given a point in time. Modeling time correctly becomes crucial in getting good estimates of the other variables.

lm_mod1 = lm(cost_per_kw ~ date + 
  county +
    sector +
    nameplate + 
    county_year_total_mw +
    incentive_per_kw + 
    contractor_year_total_mw +
    lease +
  china,
    data=pv_df)

summary(lm_mod1)
## 
## Call:
## lm(formula = cost_per_kw ~ date + county + sector + nameplate + 
##     county_year_total_mw + incentive_per_kw + contractor_year_total_mw + 
##     lease + china, data = pv_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10097   -870   -219    530  98113 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.508e+04  3.117e+02  80.464  < 2e-16 ***
## date                     -1.247e+00  1.987e-02 -62.764  < 2e-16 ***
## countyAmador             -6.246e+02  1.422e+02  -4.393 1.12e-05 ***
## countyButte              -8.038e+02  6.802e+01 -11.818  < 2e-16 ***
## countyCalaveras          -6.498e+02  1.019e+02  -6.379 1.79e-10 ***
## countyColusa             -6.895e+02  2.055e+02  -3.355 0.000794 ***
## countyContra Costa        2.842e+01  3.703e+01   0.767 0.442796    
## countyEl Dorado          -3.401e+02  5.508e+01  -6.174 6.69e-10 ***
## countyFresno             -4.257e+02  3.642e+01 -11.688  < 2e-16 ***
## countyGlenn              -5.988e+02  1.371e+02  -4.368 1.25e-05 ***
## countyHumboldt           -4.429e+01  1.451e+02  -0.305 0.760213    
## countyImperial           -1.730e+03  7.236e+02  -2.391 0.016799 *  
## countyInyo               -7.940e+02  2.605e+02  -3.048 0.002302 ** 
## countyKern               -2.710e+02  3.770e+01  -7.189 6.56e-13 ***
## countyKings              -6.525e+02  7.219e+01  -9.038  < 2e-16 ***
## countyLake               -2.912e+02  1.187e+02  -2.453 0.014187 *  
## countyLassen             -1.304e+03  5.396e+02  -2.416 0.015712 *  
## countyLos Angeles         3.433e+02  3.529e+01   9.725  < 2e-16 ***
## countyMadera             -5.231e+02  7.083e+01  -7.385 1.54e-13 ***
## countyMarin              -5.182e+02  5.253e+01  -9.865  < 2e-16 ***
## countyMariposa           -7.491e+02  2.237e+02  -3.349 0.000811 ***
## countyMendocino          -1.257e+02  9.955e+01  -1.263 0.206663    
## countyMerced             -7.444e+02  7.933e+01  -9.384  < 2e-16 ***
## countyMono               -7.730e+02  1.747e+02  -4.426 9.63e-06 ***
## countyMonterey           -3.512e+02  7.438e+01  -4.722 2.34e-06 ***
## countyNapa               -1.563e+02  6.392e+01  -2.446 0.014456 *  
## countyNevada             -7.622e+02  7.733e+01  -9.857  < 2e-16 ***
## countyOrange              2.753e+02  3.245e+01   8.483  < 2e-16 ***
## countyPlacer             -3.658e+02  4.323e+01  -8.463  < 2e-16 ***
## countyPlumas             -5.784e+02  1.519e+02  -3.809 0.000140 ***
## countyRiverside           2.606e+02  3.597e+01   7.243 4.42e-13 ***
## countySacramento         -4.581e+02  3.183e+02  -1.439 0.150032    
## countySan Benito         -5.071e+02  1.470e+02  -3.449 0.000563 ***
## countySan Bernardino      2.583e+02  3.431e+01   7.530 5.10e-14 ***
## countySan Diego           1.660e+02  3.256e+01   5.098 3.44e-07 ***
## countySan Francisco       1.242e+03  4.380e+01  28.346  < 2e-16 ***
## countySan Joaquin        -5.243e+02  5.331e+01  -9.836  < 2e-16 ***
## countySan Luis Obispo    -4.125e+02  5.009e+01  -8.235  < 2e-16 ***
## countySan Mateo          -2.941e+01  4.652e+01  -0.632 0.527268    
## countySanta Barbara      -5.090e+02  5.615e+01  -9.064  < 2e-16 ***
## countySanta Clara        -3.261e+02  3.303e+01  -9.871  < 2e-16 ***
## countySanta Cruz         -6.205e+02  5.723e+01 -10.843  < 2e-16 ***
## countyShasta             -6.998e+02  9.990e+01  -7.005 2.49e-12 ***
## countySolano             -1.808e+01  6.149e+01  -0.294 0.768767    
## countySonoma             -3.638e+02  3.914e+01  -9.296  < 2e-16 ***
## countyStanislaus         -8.378e+02  1.084e+02  -7.730 1.09e-14 ***
## countySutter             -5.551e+02  9.587e+01  -5.790 7.06e-09 ***
## countyTehama             -4.439e+02  1.557e+02  -2.851 0.004364 ** 
## countyTrinity             1.630e+03  1.617e+03   1.008 0.313346    
## countyTulare             -3.954e+02  4.452e+01  -8.882  < 2e-16 ***
## countyTuolumne           -5.340e+02  1.442e+02  -3.703 0.000213 ***
## countyVentura            -1.951e+02  3.784e+01  -5.157 2.52e-07 ***
## countyYolo               -1.743e+02  5.900e+01  -2.954 0.003136 ** 
## countyYuba                4.108e+01  1.163e+02   0.353 0.723801    
## sectorGovernment          1.310e+03  9.643e+01  13.580  < 2e-16 ***
## sectorNon-Profit         -6.267e+02  9.469e+01  -6.618 3.65e-11 ***
## sectorResidential         1.851e+02  4.239e+01   4.366 1.27e-05 ***
## nameplate                -1.217e+00  2.113e-01  -5.761 8.40e-09 ***
## county_year_total_mw     -1.958e+01  1.028e+00 -19.050  < 2e-16 ***
## incentive_per_kw          5.610e-01  1.894e-02  29.615  < 2e-16 ***
## contractor_year_total_mw -7.238e-01  6.595e-01  -1.098 0.272374    
## lease                     1.680e+02  1.295e+01  12.975  < 2e-16 ***
## china                    -3.222e+02  1.363e+01 -23.637  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1617 on 104998 degrees of freedom
##   (1562 observations deleted due to missingness)
## Multiple R-squared:  0.3747, Adjusted R-squared:  0.3743 
## F-statistic:  1015 on 62 and 104998 DF,  p-value: < 2.2e-16

Here we see the results of the first linear regression.

Given this regression, see if you can answer the following questions:

  • Is there evidence for scale economies at system level (that is, are bigger systems cheaper per kw?). Interpret the coefficient.

  • Is there evidence of learning or scale effects at the county level?

  • Do leased or host-owned panels tend to cost more? Can you explain the difference?

  • Is there evidence for subsidy induced cost-inflation? Interpret the results.

  • Do systems with Chinese panels tend to be cheaper? What is the magnitude of the effect?

Reguralisation

With the addition of the county fixed effects, we end up with quite a lot of regressors in our model. Too many regressors in a regression can be problematic if it ends up “overfitting” the model. Intuitively, the model ends up fitting the noise in the data, this leads to increased variance in the estimation, and poor out-of-sample fit (we saw a clear example of poor out-of-sample fit in the previous lab).

There are a host of tools we can use to avoid over-fitting the model. The most simple might be just to take out regressors that are not significant in the model. Alternatively, we could compare models by way of information criterion (which take into account the “cost” of too many variables), like AIC or BIC.

But it would be nice to have a method that reduces the number of variables (and increases the degrees of freedom) of a regression as an integral part of the estimation. Ridge regression and Lasso are two such tools - and come under the term “shrinkage” or “regularization”. You can read more in ISL 6.2.

We will use the Lasso in order to regularize our model. We will use the package glmnet in order to estimate the models.

#install.packages("glmnet")
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-6

If we want to estimate a linear regression:

\(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i\)

We can define lasso coefficient estimates as the \(\hat{\beta^L_j}\) that minimize (ISL p.219):

\(\sum_{i=1}^{n}(y_i - \beta_0-\sum_{j=1}^p\beta_jx_{ij})^2 + \lambda \sum_{j=1}^{p}|\beta_j|\)

The first part of this formula is standard - it is just the minimizing of the residual sum of squares, just as in normal OLS. The innovative part is the second term that includes the \(\lambda\) parameter. This is a penalty term for the number of coefficients. The idea is that each explanatory variable, with corresponding coefficient, is determined by weighing the contribution to goodness of fit (or, equivalently minimizing the squared error) against its tendency to overfit. The degree to which you want to weight against over-fitting (or, equivalently, variance of the estimates) is determined by \(\lambda\).

That means a variable that improves the predictive fit of the model by a large amount will tend to have a coefficient close to that of the OLS case. However, a variable with little effect on predictive fit will tend to be given a parameter that is close to or exactly zero (being able to set a paramaeter exactly to zero is what differentiates Lasso from ridge regression.)

The glmnet package requires a slightly different form for the data than the standard syntax in lm(). The x and y data need to be in the form of a matrix and vector.

glmnet will also not automatically drop data with NA values. So we need to do that ahead of time.

modelData = pv_df %>% select(cost_per_kw, date, county, sector, nameplate, county_year_total_mw, incentive_per_kw, contractor_year_total_mw, lease, china)

modelData = drop_na(modelData)

Here we use the command model.matrix to create a matrix of the x-variables.

x = model.matrix(cost_per_kw ~ date +
  county +
    sector +
    nameplate + 
    county_year_total_mw +
  contractor_year_total_mw +
    incentive_per_kw +
    lease +
  china,
    data=modelData)[,-1]

# The [,-1] deletes the first column, which is the y-variable, which is entered separately in the command:

y= modelData$cost_per_kw

Then we run our lasso model by using the command glmnet. The alpha parameter determines whether a ridge regression (alpha=0) or lasso regression (alpha=1) is run.

Below, I also set lambda to be equal to 1. There is nothing special about 1 - I chose it completely arbitrarily. You can experiment with higher values of lambda and see what happens to the coefficient estimation

lasso_mod0=glmnet(x,y,alpha=1,lambda=1)
coef(lasso_mod0)

But in general, arbitrarily choosing a lambda value is not a satisfactory way of modeling. Instead, we will use what is called cross-validation to choose an optimal value of lambda.

You can read more details about cross validation in ISL ch. 5.1. But the intuition is that you choose lambda (or another parameter) based on what gives the model the best predictive fit. One way of judging predictive fit is to split the model into a training set and a test set, and then seeing what model parameters give the best prediction for the test set. Cross-validation is a refinement of this idea– leaving out one data point at a time, and seeing which parameters lead to the best prediction of that data point, and then repeating the process for all data points.

Below we use cross-validation to choose our lambda using the command cv.glmnet()

cv.out=cv.glmnet(x,y,alpha=1)
bestlam=cv.out$lambda.min
print(bestlam)
## [1] 0.6901237

Then we can run our model with our cross-validation lambda

lasso_mod1=glmnet(x,y,alpha=1,lambda=bestlam)
coef(lasso_mod1)
## 63 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)              25136.5172727
## date                        -1.2529783
## countyAmador              -576.6413909
## countyButte               -766.6758442
## countyCalaveras           -606.9210310
## countyColusa              -631.5018645
## countyContra Costa          50.3737683
## countyEl Dorado           -304.6052057
## countyFresno              -398.0114774
## countyGlenn               -549.7900657
## countyHumboldt               .        
## countyImperial           -1600.0697099
## countyInyo                -727.5364902
## countyKern                -241.0698242
## countyKings               -612.8372731
## countyLake                -246.9395735
## countyLassen             -1200.7322643
## countyLos Angeles          352.0742943
## countyMadera              -484.7288666
## countyMarin               -483.6346709
## countyMariposa            -688.3352874
## countyMendocino            -84.1219484
## countyMerced              -704.0109079
## countyMono                -718.3624199
## countyMonterey            -312.4274373
## countyNapa                -120.2004180
## countyNevada              -723.5395822
## countyOrange               294.2005950
## countyPlacer              -334.1297023
## countyPlumas              -527.5822464
## countyRiverside            269.4625453
## countySacramento          -384.8090327
## countySan Benito          -456.4842647
## countySan Bernardino       273.9024819
## countySan Diego            179.8665897
## countySan Francisco       1264.9972900
## countySan Joaquin         -489.1517514
## countySan Luis Obispo     -378.8373449
## countySan Mateo              .        
## countySanta Barbara       -472.9716883
## countySanta Clara         -301.0445311
## countySanta Cruz          -585.3558350
## countyShasta              -657.4721297
## countySolano                 3.0268459
## countySonoma              -334.5571861
## countyStanislaus          -792.6577737
## countySutter              -513.0212346
## countyTehama              -393.5976187
## countyTrinity             1433.1106285
## countyTulare              -362.7199717
## countyTuolumne            -485.4358361
## countyVentura             -164.3220919
## countyYolo                -138.6882911
## countyYuba                  54.7326753
## sectorGovernment          1289.8823688
## sectorNon-Profit          -617.9891170
## sectorResidential          181.3653802
## nameplate                   -1.1913697
## county_year_total_mw       -18.7269045
## contractor_year_total_mw    -0.6186127
## incentive_per_kw             0.5610079
## lease                      165.7624168
## china                     -321.5498730

Here we see that my initial guess of lambda=1 wasn’t too bad, and the cross-validated model gave similar results. What we see is that only a couple of the county fixed-effects are set to exactly zero. If we compare to the linear model, we can also notice that some of the parameters are “shrunk” towards zero. But the results aren’t dramatic.

Generalized Additive Model (GAM)

In lab 7, we introduced several ways of estimating non-linear curves, and we used that to estimate the fall in prices of solar systems over time. In the estimation above, we have estimated how various factors affect variation in prices, given a certain point in time. However, we have modeled the path of prices over time as linear. As we saw in lab 7, this may not be an appropriate assumption. Misspecifying the time component of the regression could also lead to biased estimates of the other variables.

What we might want to do is to combine a flexible, non-linear estimation of the trend of solar systems over time, with a parametric and interperatable modeling of the other variables. This is exactly what the class of models called Generalized Additive Models allows us to do.

You can read more about Generalized Additive Models in ISL 7.7. The general idea is that you estimate an equation that looks like:

\(y_i = f(x_{i1}) + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\)

So basically, you would estimate a non-linear (“smoothed”) function of \(x_i1\) (like the average trend of costs over time), while the rest of the variables enter linearly (and parametrically), and are interpreted conditional on the function of \(x_1\).

In theory, you are not limited to one smooth function, but can fit smooth functions to multiple variables. You can even model a single function over 2 or more variables - basically creating a non-linear 2-dimensional grid. However, the danger of over-fitting data is very much relevant to GAMs. In addition, the computational practicality of fitting very complex GAMs can also come into play.

Here again, remember KISS: Keep It Simple Stupid!

To estimate a GAM in r, we can use the package mgcv:

library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.

Using the the function gam, we can quickly fit a GAM model. We use s() to indicate which variable we want to estimate non-parametrically. The default in gam is to estimate a smoothed cubic regression spline. However it is also possible to specify other forms of non-linear curves. You can see ISL 7.8.3

Here we estimate a model very similar to the linear model we started with, but we estimate the time variable as a smoothed term. Instead of using the date variable, I have created a variable time_days, which measures the number of days since 1st of january 2006. (Warning: This might take a few minutes to compute.)

gam_mod1=gam(cost_per_kw ~ s(time_days) + 
    sector +
    nameplate + 
    county_year_total_mw +
     contractor_year_total_mw +
    incentive_per_kw + 
    lease +
    china, 
    family=gaussian,
    data=pv_df)

We can visualize the smoothed curve over time

plot.gam(gam_mod1, se=TRUE)

Does this seem reasonable?

We can also view the estimated parameters

summary(gam_mod1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cost_per_kw ~ s(time_days) + sector + nameplate + county_year_total_mw + 
##     contractor_year_total_mw + incentive_per_kw + lease + china
## 
## Parametric coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5420.49381   43.62103 124.263  < 2e-16 ***
## sectorGovernment          971.41452   95.93177  10.126  < 2e-16 ***
## sectorNon-Profit         -662.83736   94.28511  -7.030 2.08e-12 ***
## sectorResidential         172.68185   42.17985   4.094 4.24e-05 ***
## nameplate                  -2.39375    0.20822 -11.496  < 2e-16 ***
## county_year_total_mw        3.48261    0.60561   5.751 8.92e-09 ***
## contractor_year_total_mw    2.77259    0.66250   4.185 2.85e-05 ***
## incentive_per_kw            0.89633    0.01791  50.049  < 2e-16 ***
## lease                     128.70390   12.78764  10.065  < 2e-16 ***
## china                    -303.19160   13.63610 -22.234  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(time_days) 8.956  8.999 945.4  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.377   Deviance explained = 37.7%
## GCV = 2.6034e+06  Scale est. = 2.603e+06  n = 105061

Now we’ll use the model to create a prediction based on the average values of the continuous parametric terms, host-owned panels and non-Chinese panels.

First we create our “new” data to create the prediction from.

pdat1 =with(pv_df,
list(
time_days = round(seq(min(time_days), max(time_days), length = 200)),
sector = rep("Residential",200),
nameplate = rep(mean(nameplate), 200),
county_year_total_mw = rep(mean(county_year_total_mw), 200),
contractor_year_total_mw = rep(mean(contractor_year_total_mw), 200),
incentive_per_kw = rep(mean(incentive_per_kw/1000), 200),
lease = rep(0, 200),
china = rep(0, 200)
))

Now we use the predict() command (which is actually predict.gam()) to create our prediction.

What is a little different about the output of predict.gam is that we are given the individual components of the prediction from each of the variables (so basically the estimated coefficient times the data we entered in the box above.)

We have to create our overall prediction of cost_per_kw by summing up all the rows plus adding the estimated intercept term.

pred1 = predict(gam_mod1, pdat1, type = "terms", se.fit = TRUE)

pred1_fit = as_tibble(pred1$fit)
pred1_fit["intercept"] = coef(gam_mod1)[1]

pred1_fit = pred1_fit %>% mutate(prediction = rowSums(.))

pred1_fit["days"] = with(pv_df, round(seq(min(time_days), max(time_days), length = 200)))

We will also sum up the standard error components and then create confidence intervals

pred1_se= as_tibble(pred1$se)
pred1_fit["prediction_se"] = rowSums(pred1_se)
pred1_fit["upper"] = pred1_fit$prediction +  2 * pred1_fit$prediction_se
pred1_fit["lower"] = pred1_fit$prediction - 2* pred1_fit$prediction_se

Then we can plot the prediction:

ggplot(pred1_fit, aes(x=days, y=prediction, ymin=upper, ymax=lower))+
    geom_ribbon(alpha=.5, fill="grey") +
    geom_line() + 
    labs(x="Days since 1.1.2006", y="Prediction") +
    theme_bw() 

Audio: Using and interpreting predictions from the GAM model.

Summary: What we learned from looking at solar panel data from California

Assignment

1.) Can you create the variables zip_year_total_mw, that is, the cumulative amount of capacity in each zip code (a finer geographic division than county). Chart the relationship between total installed capacity in a year and costs at the zip level. Include zip_year_total_mw in the linear model instead of county_year_total_mw. Are the estimated results substantially different?

2.) In lab 7 we used cumulative capacity to model learning curves. In this lab we used days/dates to model change of prices over time. Run regressions (linear and semi-parametric (GAM)) where you use cumulative capacity instead of days/date to model the effects over time. For the semi-parametric model, create a prediction and compare with the results in the lab. Are the results substantially different?

3.) Now create a prediction model from a GAM estimation for Chinese panels over time. In addition, create a new variable that indicates the share (from 0-1) of Chinese panels among installed solar panels per month. Use this series to create a prediction model (so instead of inputting a column of 0s or 1s to indicate Chinese, or non-Chinese panels, you would put in the series representing the share of panels.) Compare the predictive curve of cost over time with the curve representing non-Chinese panels. What does this tell you about the influence of Chinese panels on the average cost of solar panel systems during this time period?

4.) Open ended question: What other questions could you answer with this data set? Show in the form figures, regressions, or other estimations. You could also consider downloading updated data here.

Solution sketch

1.) Can you create the variables zip_year_total_mw, that is, the cumulative amount of capacity in each zip code (a finer geographic division than county). Chart the relationship between total installed capacity in a year and costs at the zip level. Include zip_year_total_mw in the linear model instead of county_year_total_mw. Are the estimated results substantially different?

For adding an extra variable to the existing data set we use mutate

pv_df = pv_df %>% group_by(zip, year) %>% mutate(
  zip_year_total_mw = sum(nameplate)/1000
)

If we want to look at the zip-code data in its own data we can use summarise. Then we can plot the relationship between installed capacity in a zip and costs to see if we see any relationship.

zip_data = pv_df %>% group_by(zip, year) %>% summarise(
  zip_year_total_mw = sum(nameplate)/1000,
  zip_avg_costs = mean(cost_per_kw, na.rm=TRUE)
)
## `summarise()` has grouped output by 'zip'. You can override using the `.groups`
## argument.
zip_data %>% ggplot(aes(x=zip_year_total_mw, y=zip_avg_costs)) +
  geom_point(alpha=.5) + 
  geom_smooth() +
  facet_wrap(~year, ncol=2, scales="free_y")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

pv_df["incentive_per_kw"] = pv_df$incentive_amount/pv_df$nameplate

pv_df$county_year_total_mw = pv_df$county_year_total/1000
pv_df$contractor_year_total_mw = pv_df$contractor_year_total/1000
pv_df$contractor_market_share_perc = pv_df$contractor_market_share*100

lm_mod1 = lm(cost_per_kw ~ date + 
  county +
    sector +
    nameplate + 
  zip_year_total_mw +
    incentive_per_kw + 
    contractor_year_total_mw +
    lease +
  china,
    data=pv_df)

summary(lm_mod1)
## 
## Call:
## lm(formula = cost_per_kw ~ date + county + sector + nameplate + 
##     zip_year_total_mw + incentive_per_kw + contractor_year_total_mw + 
##     lease + china, data = pv_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10243   -871   -224    540  97972 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.682e+04  2.966e+02  90.435  < 2e-16 ***
## date                     -1.369e+00  1.872e-02 -73.120  < 2e-16 ***
## countyAmador             -5.729e+02  1.424e+02  -4.024 5.72e-05 ***
## countyButte              -7.381e+02  6.803e+01 -10.850  < 2e-16 ***
## countyCalaveras          -5.862e+02  1.020e+02  -5.749 9.00e-09 ***
## countyColusa             -6.135e+02  2.058e+02  -2.981 0.002871 ** 
## countyContra Costa        1.924e+01  3.708e+01   0.519 0.603803    
## countyEl Dorado          -2.470e+02  5.511e+01  -4.481 7.43e-06 ***
## countyFresno             -4.456e+02  3.659e+01 -12.180  < 2e-16 ***
## countyGlenn              -4.903e+02  1.372e+02  -3.575 0.000351 ***
## countyHumboldt           -1.594e+01  1.453e+02  -0.110 0.912679    
## countyImperial           -1.679e+03  7.247e+02  -2.317 0.020492 *  
## countyInyo               -6.770e+02  2.608e+02  -2.596 0.009429 ** 
## countyKern               -2.295e+02  3.796e+01  -6.047 1.48e-09 ***
## countyKings              -4.654e+02  7.284e+01  -6.389 1.68e-10 ***
## countyLake               -2.552e+02  1.189e+02  -2.146 0.031904 *  
## countyLassen             -1.246e+03  5.405e+02  -2.306 0.021096 *  
## countyLos Angeles         3.066e+00  3.032e+01   0.101 0.919431    
## countyMadera             -4.544e+02  7.084e+01  -6.414 1.42e-10 ***
## countyMarin              -4.804e+02  5.257e+01  -9.140  < 2e-16 ***
## countyMariposa           -6.669e+02  2.240e+02  -2.977 0.002907 ** 
## countyMendocino          -8.431e+01  9.968e+01  -0.846 0.397651    
## countyMerced             -6.500e+02  7.928e+01  -8.199 2.45e-16 ***
## countyMono               -6.941e+02  1.749e+02  -3.970 7.21e-05 ***
## countyMonterey           -2.896e+02  7.442e+01  -3.892 9.95e-05 ***
## countyNapa               -9.467e+01  6.399e+01  -1.479 0.139028    
## countyNevada             -6.939e+02  7.736e+01  -8.970  < 2e-16 ***
## countyOrange              1.651e+02  3.200e+01   5.160 2.47e-07 ***
## countyPlacer             -3.154e+02  4.337e+01  -7.272 3.56e-13 ***
## countyPlumas             -4.931e+02  1.520e+02  -3.243 0.001182 ** 
## countyRiverside          -4.427e+01  3.178e+01  -1.393 0.163588    
## countySacramento         -3.907e+02  3.187e+02  -1.226 0.220293    
## countySan Benito         -4.065e+02  1.472e+02  -2.762 0.005738 ** 
## countySan Bernardino      1.054e+02  3.328e+01   3.166 0.001544 ** 
## countySan Diego          -5.057e+01  3.049e+01  -1.658 0.097270 .  
## countySan Francisco       1.286e+03  4.380e+01  29.371  < 2e-16 ***
## countySan Joaquin        -4.603e+02  5.327e+01  -8.641  < 2e-16 ***
## countySan Luis Obispo    -3.632e+02  5.012e+01  -7.247 4.29e-13 ***
## countySan Mateo          -4.488e-01  4.657e+01  -0.010 0.992312    
## countySanta Barbara      -4.386e+02  5.610e+01  -7.817 5.44e-15 ***
## countySanta Clara        -4.019e+02  3.281e+01 -12.249  < 2e-16 ***
## countySanta Cruz         -5.779e+02  5.727e+01 -10.091  < 2e-16 ***
## countyShasta             -6.369e+02  9.999e+01  -6.369 1.90e-10 ***
## countySolano              5.927e+01  6.145e+01   0.964 0.334848    
## countySonoma             -3.571e+02  3.925e+01  -9.098  < 2e-16 ***
## countyStanislaus         -7.068e+02  1.085e+02  -6.514 7.34e-11 ***
## countySutter             -4.665e+02  9.589e+01  -4.864 1.15e-06 ***
## countyTehama             -3.838e+02  1.559e+02  -2.461 0.013840 *  
## countyTrinity             1.617e+03  1.620e+03   0.999 0.317970    
## countyTulare             -3.109e+02  4.469e+01  -6.957 3.49e-12 ***
## countyTuolumne           -4.894e+02  1.444e+02  -3.388 0.000703 ***
## countyVentura            -1.440e+02  3.796e+01  -3.794 0.000148 ***
## countyYolo               -1.116e+02  5.900e+01  -1.891 0.058634 .  
## countyYuba                1.184e+02  1.164e+02   1.018 0.308739    
## sectorGovernment          1.311e+03  9.661e+01  13.573  < 2e-16 ***
## sectorNon-Profit         -6.330e+02  9.484e+01  -6.675 2.48e-11 ***
## sectorResidential         1.809e+02  4.246e+01   4.260 2.05e-05 ***
## nameplate                -1.143e+00  2.127e-01  -5.374 7.71e-08 ***
## zip_year_total_mw        -1.131e+02  1.755e+01  -6.441 1.19e-10 ***
## incentive_per_kw          5.829e-01  1.893e-02  30.788  < 2e-16 ***
## contractor_year_total_mw -3.593e-01  6.602e-01  -0.544 0.586241    
## lease                     1.820e+02  1.296e+01  14.044  < 2e-16 ***
## china                    -3.391e+02  1.362e+01 -24.894  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1619 on 104998 degrees of freedom
##   (1562 observations deleted due to missingness)
## Multiple R-squared:  0.3728, Adjusted R-squared:  0.3724 
## F-statistic:  1006 on 62 and 104998 DF,  p-value: < 2.2e-16

Here we notice that zip_year_total_mw is estimated to be significantly negative and of a much larger magnitude than county_year_total_mw. But considering that zip codes are much smaller, then this might not be surprising. A MW of extra capacity within a zip code might indicate a much more significant scale or learnings effects compared to a MW of extra capacity in a county.

In this regression I have kept the county fixed effects. You could replace this with zip fixed effects, but this would lead to many more parameters in the model. So you would probably want to combine that with a lasso in order to exclude zips with no predictive power.

2.) In lab 7 we used cumulative capacity to model learning curves. In this lab we used days/dates to model change of prices over time. Run regressions (linear and semi-parametric (GAM)) where you use cumulative capacity instead of days/date to model the effects over time. For the semi-parametric model, create a prediction and compare with the results in the lab. Are the results substantially different?

I will just show with the GAM model, where I will now take the log base 2 of the cost variables and all the continuous variables on the right side, including the cumulative variable.

pv_df = pv_df %>% arrange(date) %>% mutate(
  cum_cap = cumsum(nameplate)
)

#make sure all continuous variables are more than zero (in order to do log transform)
pv_df = pv_df %>% filter(
  cum_cap>0 &
  cost_per_kw>0 &
  nameplate>0 &
  contractor_year_total_mw >0 &
  incentive_per_kw >0
)

pv_df = pv_df %>% mutate(
  log2_cost_per_kw = log2(cost_per_kw),
  log2_cum_cap =  log2(cum_cap),
  log2_nameplate = log2(nameplate),
  log2_county_year_total_mw = log2(county_year_total),
  log2_contractor_year_total_mw = log2(contractor_year_total_mw),
  log2_incentive_per_kw = log2(incentive_per_kw)
)

gam_mod1=gam(log2_cost_per_kw ~ s(log2_cum_cap) + 
    sector +
    log2_nameplate + 
    log2_county_year_total_mw +
    log2_contractor_year_total_mw +
    log2_incentive_per_kw + 
    lease +
    china, 
    family=gaussian,
    data=pv_df)

summary(gam_mod1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log2_cost_per_kw ~ s(log2_cum_cap) + sector + log2_nameplate + 
##     log2_county_year_total_mw + log2_contractor_year_total_mw + 
##     log2_incentive_per_kw + lease + china
## 
## Parametric coefficients:
##                                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   11.4869085  0.0149248  769.654  < 2e-16 ***
## sectorGovernment               0.1066905  0.0175534    6.078 1.22e-09 ***
## sectorNon-Profit              -0.2859152  0.0172414  -16.583  < 2e-16 ***
## sectorResidential             -0.1273664  0.0078176  -16.292  < 2e-16 ***
## log2_nameplate                -0.1314661  0.0012210 -107.672  < 2e-16 ***
## log2_county_year_total_mw     -0.0130877  0.0005932  -22.061  < 2e-16 ***
## log2_contractor_year_total_mw  0.0086935  0.0003729   23.314  < 2e-16 ***
## log2_incentive_per_kw          0.1903649  0.0008610  221.104  < 2e-16 ***
## lease                          0.0284802  0.0025415   11.206  < 2e-16 ***
## china                         -0.0781704  0.0023566  -33.170  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## s(log2_cum_cap) 7.917  8.662 12.72  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.494   Deviance explained = 49.4%
## GCV = 0.087942  Scale est. = 0.087927  n = 105052
plot.gam(gam_mod1, se=TRUE)

We would read this plot quite a bit different. An increase of 1 on the x-axis is interpreted as a doubling of capacity. The results here seem to indicate that initially there was a big decrease in costs in the first two doublings or so, then a period of much lower declines.

3.) Now create a prediction model from a GAM estimation for chinese panels over time. In addition, create a new variable that indicates the share (from 0-1) of chinese panels among installed solar panels per month. Use this series to create a prediction model (so instead of inputting a column of 0s or 1s to indicate chinese, or non-chinese panels, you would put in the series representing the share of panels.) Compare the predictive curve of cost over time with the curve representing non-chinese panels. What does this tell you about the influence of chinese panels on solar panel systems.

We will use the model from problem 2

pdat1 = with(pv_df,
list(
log2_cum_cap = seq(min(log2_cum_cap), max(log2_cum_cap), length = 200),
sector = rep("Residential",200),
log2_nameplate = rep(mean(log2_nameplate), 200),
log2_county_year_total_mw = rep(mean(log2_county_year_total_mw), 200),
log2_contractor_year_total_mw = rep(mean(log2_contractor_year_total_mw), 200),
log2_incentive_per_kw = rep(mean(log2_incentive_per_kw), 200),
lease = rep(0, 200),
china = rep(0, 200)
))
pred1 = predict(gam_mod1, pdat1, type = "terms", se.fit = TRUE)

pred1_fit = as_tibble(pred1$fit)
pred1_fit["intercept"] = coef(gam_mod1)[1]

pred1_fit = pred1_fit %>% mutate(prediction = rowSums(.))

pred1_fit["log2_cum_cap"] = with(pv_df, seq(min(log2_cum_cap), max(log2_cum_cap), length = 200))

predCurve = pred1_fit %>% select(prediction,log2_cum_cap)

Then we can plot the prediction:

ggplot(predCurve, aes(x=log2_cum_cap, y=prediction))+
    geom_line() + 
    labs(x="Doubling of Cumulative Capacity", y="Prediction") +
    theme_bw() 

Now lets add a prediction with chinese panels

pdat2 = with(pv_df,
list(
log2_cum_cap = seq(min(log2_cum_cap), max(log2_cum_cap), length = 200),
sector = rep("Residential",200),
log2_nameplate = rep(mean(log2_nameplate), 200),
log2_county_year_total_mw = rep(mean(log2_county_year_total_mw), 200),
log2_contractor_year_total_mw = rep(mean(log2_contractor_year_total_mw), 200),
log2_incentive_per_kw = rep(mean(log2_incentive_per_kw), 200),
lease = rep(0, 200),
china = rep(1, 200)
))
pred2 = predict(gam_mod1, pdat2, type = "terms", se.fit = TRUE)

pred2_fit = as_tibble(pred2$fit)
pred2_fit["intercept"] = coef(gam_mod1)[1]

pred2_fit = pred2_fit %>% mutate(prediction = rowSums(.))

predCurve["prediction2"] = pred2_fit$prediction

predCurve_long = predCurve %>% pivot_longer(-log2_cum_cap)

predCurve_long %>% ggplot(aes(x=log2_cum_cap, y=value, color=name)) +
  geom_line()

The Chinese indicator enters additively in our model, so it just shifts down our model.

Finally, what if Chinese is represented in our prediction not as just a 0 or 1, but as the share of the panels on the market

Create a categorical variable dividing up the data in 200 parts and calculate the chinese share at each point in time

pv_df = pv_df %>% arrange(log2_cum_cap)

pv_df["cap_cat"] = cut_number(pv_df$log2_cum_cap, 200)

chinaShare = pv_df %>% group_by(cap_cat) %>% summarise(
  chinaShare = mean(china, na.rm=TRUE),
  log2_cum_cap = mean(log2_cum_cap, na.rm=TRUE)
  )

chinaShare %>% ggplot(aes(x=log2_cum_cap, y=chinaShare)) + geom_line()

pdat3 = with(pv_df,
list(
log2_cum_cap = seq(min(log2_cum_cap), max(log2_cum_cap), length = 200),
sector = rep("Residential",200),
log2_nameplate = rep(mean(log2_nameplate), 200),
log2_county_year_total_mw = rep(mean(log2_county_year_total_mw), 200),
log2_contractor_year_total_mw = rep(mean(log2_contractor_year_total_mw), 200),
log2_incentive_per_kw = rep(mean(log2_incentive_per_kw), 200),
lease = rep(0, 200),
china = chinaShare$chinaShare
))

Now a prediction

pred3 = predict(gam_mod1, pdat3, type = "terms", se.fit = TRUE)

pred3_fit = as_tibble(pred3$fit)
pred3_fit["intercept"] = coef(gam_mod1)[1]

pred3_fit = pred3_fit %>% mutate(prediction = rowSums(.))

predCurve["prediction3"] = pred3_fit$prediction

predCurve_long = predCurve %>% pivot_longer(-log2_cum_cap)

predCurve_long %>% ggplot(aes(x=log2_cum_cap, y=value, color=name)) +
  geom_line()

What we get here is a nice visualisation of the effect of Chinese panels on the overall reduction in costs.