Linear Algebra Application: Least Squares Regression¶

Learning goals¶

  • Understand how a least squares regression is computed
  • Understand how to formulate a least squares model in matrix form
  • Be able to manually compute the solution to a least squares regression problem
  • Be able to interpret and critique a least squares regression model

Literature¶

  • PNM Ch. 16

  • QuantEcon lecture 69

One of the major applications of Linear Algebra techniques is in data analysis. Basically every non-trivial form of regression or machine learning algorithm relies, to some degree, on the basics of matrix algebra.

Ordinary Least Squares (OLS) regression is one of the most common forms of regression and also the one with the closest connection to pure matrix algebra.

Often the actual matrix algebra of OLS is done within a pre-packaged function. In this lab we will work through the actual steps of setting up and computing the OLS coefficients and variance-covariance matrix.

Model with 1 exogenous regressor¶

We start with the simplist case where we have one dependent variable ($y$) and one explanatory variables ($x$)

$$y_i = \beta_0 + \beta_1 x_i + \epsilon$$

Our goal is to estimate $\beta_0$ and $\beta_1$: we will denote estimates with a "hat": $\hat{\beta_0}$, $\hat{\beta_1}$.

As the name of the methodology implies, we will estimate the betas by minimizing the squared residuals. Thus if we define the residual as:

$$e_i = y_i-(\beta_0 + \beta_1 x_i)$$

Then our objective function is:

$$ \min_{\beta_0, \beta_1} \sum_i^n e_i^2 $$

Which we can rewrite

$$ \min_{\beta_0, \beta_1} S = \sum_i^n (y_i-\beta_0 - \beta_1 x_i)^2 $$

We then take the derivative of this function (using the chain rule) with respect to $\beta_0$ and $\beta_1$ to get our first order conditions:

$$\frac{\delta S}{\delta \beta_0} = -2 \sum_i (y_i-\beta_0 - \beta_1 x_i) = 0$$$$\frac{\delta S}{\delta \beta_1} = -2 \sum_i x_i(y_i-\beta_0 - \beta_1 x_i) = 0$$

From these two conditions, we can then find the $\hat{\beta_0}$ and $\hat{\beta_1}$ that minimizes our function:

Starting with the first condition, which we can rewrite

$$\sum_i^n y_i = \sum_i^n \hat{\beta_0} + \sum_i^n \hat{\beta_1} x_i $$$$\sum_i^n y_i = n*\hat{\beta_0} + \hat{\beta_1} \sum_i^n x_i$$

And then from the second first-order condition we can write:

$$\sum_i^n x_i y_i = \hat{\beta_0} \sum_i^n x_i + \hat{\beta_1} \sum_{i} ^n x_i^2$$

Solving for $\hat{\beta_0}$ and $\hat{\beta_1}$ we end up with

$$\hat{\beta_0} = \frac{(\sum_i^n x_i^2)(\sum_i^n y_i)-(\sum_i^n x_i)(\sum_i^n x_i y_i)}{n \sum x_i^2 - (\sum x_i)^2}$$

and

$$\hat{\beta_1} = \frac{n \sum_i^n x_i y_i - (\sum_i^n x_i)(\sum_i^n y_i)}{n \sum_i^n x_i^2 - (\sum_i^n x_i)^2}$$

Exercise:¶

Show the steps between the FOC's and the two equations shown above

First defining variance and covariance, we can rewrite the above equations for our least squares estimates as:

$$ var(x) = E(x-E(x))^2 => \frac{1}{n-1} \sum_i^n (x_i - \bar{x})^2 $$
$$cov(x, y) = E(x-E(x)(y-E(y))) = > \frac{1}{n-1} \sum_i^n (x_i-\bar{x})(y_i-\bar{y})$$
$$\beta_1 = \frac{cov(x,y)}{var(x)}$$$$\beta_0 = \bar{y} - \beta_1*\bar{x}$$

We can now manually calculate out an OLS regression with one explanatory variable. As illustration we use the example from QuantEcon, which looks at the correlation between economic outcomes, as represented by log GDP and political institutions that protect property rights, represented by an index that measures a countries political protections against expropriation

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as spt

We will use the package Pandas to read in our data and store the data set in a data frame. If you are new to Pandas, this lab from my applied statistics course might be helpful.

In [ ]:
df1 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true')
df1.head()
Out[ ]:
shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a cons00a extmort4 logem4 loghjypl baseco
0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0 1.0 93.699997 4.540098 NaN NaN
1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0 1.0 280.000000 5.634789 -3.411248 1.0
2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN NaN NaN NaN NaN NaN
3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 3.0 68.900002 4.232656 -0.872274 1.0
4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN NaN NaN NaN NaN NaN

There are a fair amount of missing data in our variables of interest, so we will filter out those first

In [ ]:
df1 = df1.dropna(subset=['logpgp95', 'avexpr']).copy()

We can do a quick plot

In [ ]:
df1.plot.scatter(x="avexpr", y="logpgp95")
Out[ ]:
<Axes: xlabel='avexpr', ylabel='logpgp95'>

Now we'll pull out our x and y variables and transform them into np arrays:

In [ ]:
y = np.array(df1.logpgp95)
x = np.array(df1.avexpr)

We can estimate variance and covariance

In [ ]:
var_x = np.mean(x-np.mean(x)**2)
var_y = np.mean(x-np.mean(x)**2)

cov_xy = np.mean((x-np.mean(x))*(y-np.mean(y)))

#or using built in functions

var_x = np.var(x)
var_y = np.var(y)

cov_xy = np.cov([x,y])

#notice the cov functions also gives us our variances
cov_xy 

cov_xy = cov_xy[0,1]
cov_xy
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Out[ ]:
1.5103186257654644

We can then estimate our coefficients

In [ ]:
beta_1 = cov_xy/var_x
beta_0 = np.mean(y) - beta_1*np.mean(x)
print("beta_0: ", beta_0, "beta_1: ", beta_1 )
beta_0:  4.591243539073464 beta_1:  0.5367065373944262

We can then create our fitted values, y_hat, and plot that together with our data

In [ ]:
y_hat = beta_0 + beta_1*x
In [ ]:
fig, ax = plt.subplots()
ax.scatter(x=x, y=y)
ax.plot(x, y_hat, color="red")
ax.set_xlabel("Protection against appropriation")
ax.set_ylabel("Log GDP")
Out[ ]:
Text(0, 0.5, 'Log GDP')

OLS in matrix form¶

Now we consider a case where we have 3 exogenous variables, thus we have a model we could write as:

$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_p x_{i3} + \epsilon_i$$

This we can write in matrix form

$$\begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & x_{21} & x_{22} & x_{23}\\ . & . & . & . \\ 1 & x_{n1} & x_{n2} & x_{n3} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_1 \\ ... \\ \epsilon_n \end{bmatrix} $$

Which we can write compactly as

$$\mathbf{y} = \mathbf{X\beta + \epsilon}$$

We can write our solution as:

$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$

Exercise¶

With a model with only 1 explanatory variable confirm, using matrix multiplication, that the matrix formulation of the solution for the $\beta$ is equivalent to the solution we described earlier

Now we'll compute our beta's using our matrix algebra approach. We will download a data set again looking at the relationship between economic outcomes (GDP) and protection against expropriation index (avexpr). We also include latitude (lat_abst) to controll for the effects of climate. We will also include indicator variables for asia and africa.

In [ ]:
df2 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable2.dta?raw=true')
df2
Out[ ]:
shortnam africa lat_abst avexpr logpgp95 other asia loghjypl baseco
0 AFG 0.0 0.366667 NaN NaN 0.0 1.0 NaN NaN
1 AGO 1.0 0.136667 5.363636 7.770645 0.0 0.0 -3.411248 1.0
2 ARE 0.0 0.266667 7.181818 9.804219 0.0 1.0 NaN NaN
3 ARG 0.0 0.377778 6.386364 9.133459 0.0 0.0 -0.872274 1.0
4 ARM 0.0 0.444444 NaN 7.682482 0.0 1.0 NaN NaN
... ... ... ... ... ... ... ... ... ...
158 YUG 0.0 0.488889 6.318182 NaN 0.0 0.0 -1.203973 NaN
159 ZAF 1.0 0.322222 6.863636 8.885994 0.0 0.0 -1.386294 1.0
160 ZAR 1.0 0.000000 3.500000 6.866933 0.0 0.0 -3.411248 1.0
161 ZMB 1.0 0.166667 6.636364 6.813445 0.0 0.0 -2.975930 NaN
162 ZWE 1.0 0.222222 6.000000 7.696213 0.0 0.0 -2.733368 NaN

163 rows × 9 columns

We will again drop rows with missing data. We will also add a column with just 1's - this will represent the estimation of the constant term, $\beta_0$

In [ ]:
df2 = df2.dropna(subset=['logpgp95', 'avexpr', 'lat_abst', 'asia', 'africa']).copy()

df2["const"] = 1

y = np.array(df2.loc[:,'logpgp95'])
X = np.array(df2.loc[:,['const', 'avexpr', 'lat_abst', 'asia', 'africa']])
X
Out[ ]:
array([[ 1.        ,  5.36363649,  0.13666667,  0.        ,  1.        ],
       [ 1.        ,  7.18181801,  0.26666668,  1.        ,  0.        ],
       [ 1.        ,  6.38636351,  0.37777779,  0.        ,  0.        ],
       [ 1.        ,  9.31818199,  0.30000001,  0.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.52444446,  0.        ,  0.        ],
       [ 1.        ,  9.68181801,  0.56111109,  0.        ,  0.        ],
       [ 1.        ,  4.4545455 ,  0.14444445,  0.        ,  1.        ],
       [ 1.        ,  5.13636351,  0.26666668,  1.        ,  0.        ],
       [ 1.        ,  8.909091  ,  0.47777778,  0.        ,  0.        ],
       [ 1.        ,  8.        ,  0.2888889 ,  1.        ,  0.        ],
       [ 1.        ,  7.5       ,  0.26833332,  0.        ,  0.        ],
       [ 1.        ,  5.63636351,  0.18888889,  0.        ,  0.        ],
       [ 1.        ,  7.909091  ,  0.11111111,  0.        ,  0.        ],
       [ 1.        ,  7.72727251,  0.24444444,  0.        ,  1.        ],
       [ 1.        ,  9.72727299,  0.66666669,  0.        ,  0.        ],
       [ 1.        , 10.        ,  0.52222222,  0.        ,  0.        ],
       [ 1.        ,  7.81818199,  0.33333334,  0.        ,  0.        ],
       [ 1.        ,  7.77272749,  0.3888889 ,  1.        ,  0.        ],
       [ 1.        ,  7.        ,  0.08888889,  0.        ,  1.        ],
       [ 1.        ,  6.4545455 ,  0.06666667,  0.        ,  1.        ],
       [ 1.        ,  4.68181801,  0.01111111,  0.        ,  1.        ],
       [ 1.        ,  7.31818199,  0.04444445,  0.        ,  0.        ],
       [ 1.        ,  7.0454545 ,  0.11111111,  0.        ,  0.        ],
       [ 1.        ,  8.86363602,  0.54944444,  0.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.62222224,  0.        ,  0.        ],
       [ 1.        ,  6.18181801,  0.21111111,  0.        ,  0.        ],
       [ 1.        ,  6.5       ,  0.31111112,  0.        ,  1.        ],
       [ 1.        ,  6.5454545 ,  0.02222222,  0.        ,  0.        ],
       [ 1.        ,  6.77272749,  0.30000001,  0.        ,  1.        ],
       [ 1.        ,  9.68181801,  0.44444445,  0.        ,  0.        ],
       [ 1.        ,  5.72727251,  0.08888889,  0.        ,  1.        ],
       [ 1.        ,  9.72727299,  0.71111113,  0.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.51111114,  0.        ,  0.        ],
       [ 1.        ,  7.81818199,  0.01111111,  0.        ,  1.        ],
       [ 1.        ,  9.77272701,  0.60000002,  0.        ,  0.        ],
       [ 1.        ,  6.27272749,  0.08888889,  0.        ,  1.        ],
       [ 1.        ,  6.5454545 ,  0.12222222,  0.        ,  1.        ],
       [ 1.        ,  8.27272701,  0.14755556,  0.        ,  1.        ],
       [ 1.        ,  7.81818199,  0.43333334,  0.        ,  0.        ],
       [ 1.        ,  5.13636351,  0.17      ,  0.        ,  0.        ],
       [ 1.        ,  5.88636351,  0.05555556,  0.        ,  0.        ],
       [ 1.        ,  8.13636398,  0.24611111,  1.        ,  0.        ],
       [ 1.        ,  5.31818199,  0.16666667,  0.        ,  0.        ],
       [ 1.        ,  3.72727275,  0.21111111,  0.        ,  0.        ],
       [ 1.        ,  9.        ,  0.52222222,  0.        ,  0.        ],
       [ 1.        ,  7.590909  ,  0.0556    ,  1.        ,  0.        ],
       [ 1.        ,  8.27272701,  0.22220001,  1.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.58888888,  0.        ,  0.        ],
       [ 1.        ,  4.77272749,  0.35555556,  1.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.72222221,  0.        ,  0.        ],
       [ 1.        ,  8.54545498,  0.34777778,  1.        ,  0.        ],
       [ 1.        ,  9.45454502,  0.47222221,  0.        ,  0.        ],
       [ 1.        ,  7.090909  ,  0.20166667,  0.        ,  0.        ],
       [ 1.        ,  6.77272749,  0.34444445,  1.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.40000001,  1.        ,  0.        ],
       [ 1.        ,  6.0454545 ,  0.01111111,  0.        ,  1.        ],
       [ 1.        ,  8.63636398,  0.41111112,  1.        ,  0.        ],
       [ 1.        ,  7.0454545 ,  0.32555553,  1.        ,  0.        ],
       [ 1.        ,  6.0454545 ,  0.07777778,  1.        ,  0.        ],
       [ 1.        , 10.        ,  0.54944444,  0.        ,  0.        ],
       [ 1.        ,  7.090909  ,  0.35555556,  0.        ,  1.        ],
       [ 1.        ,  4.4545455 ,  0.22222222,  0.        ,  1.        ],
       [ 1.        ,  7.5       ,  0.25555557,  0.        ,  0.        ],
       [ 1.        ,  4.        ,  0.18888889,  0.        ,  1.        ],
       [ 1.        ,  7.22727251,  0.39444444,  0.        ,  0.        ],
       [ 1.        ,  7.36363649,  0.51111114,  1.        ,  0.        ],
       [ 1.        ,  6.5       ,  0.20166667,  0.        ,  1.        ],
       [ 1.        ,  6.81818199,  0.14777778,  0.        ,  1.        ],
       [ 1.        ,  7.9545455 ,  0.02555555,  1.        ,  0.        ],
       [ 1.        ,  5.        ,  0.17777778,  0.        ,  1.        ],
       [ 1.        ,  5.5454545 ,  0.11111111,  0.        ,  1.        ],
       [ 1.        ,  5.22727251,  0.14444445,  0.        ,  0.        ],
       [ 1.        , 10.        ,  0.58111107,  0.        ,  0.        ],
       [ 1.        ,  9.95454502,  0.68888891,  0.        ,  0.        ],
       [ 1.        ,  9.72727299,  0.45555556,  0.        ,  0.        ],
       [ 1.        ,  7.13636351,  0.23333333,  1.        ,  0.        ],
       [ 1.        ,  6.0454545 ,  0.33333334,  1.        ,  0.        ],
       [ 1.        ,  5.909091  ,  0.1       ,  0.        ,  0.        ],
       [ 1.        ,  5.77272749,  0.11111111,  0.        ,  0.        ],
       [ 1.        ,  5.4545455 ,  0.14444445,  1.        ,  0.        ],
       [ 1.        ,  7.68181801,  0.5777778 ,  0.        ,  0.        ],
       [ 1.        ,  9.18181801,  0.43666667,  0.        ,  0.        ],
       [ 1.        ,  6.9545455 ,  0.25555557,  0.        ,  0.        ],
       [ 1.        ,  7.72727251,  0.28111109,  1.        ,  0.        ],
       [ 1.        ,  7.27272749,  0.51111114,  0.        ,  0.        ],
       [ 1.        ,  8.45454502,  0.66666669,  0.        ,  0.        ],
       [ 1.        ,  7.590909  ,  0.27777779,  1.        ,  0.        ],
       [ 1.        ,  4.        ,  0.16666667,  0.        ,  1.        ],
       [ 1.        ,  6.        ,  0.15555556,  0.        ,  1.        ],
       [ 1.        ,  9.31818199,  0.01355556,  1.        ,  0.        ],
       [ 1.        ,  5.81818199,  0.09222222,  0.        ,  1.        ],
       [ 1.        ,  5.        ,  0.15000001,  0.        ,  0.        ],
       [ 1.        ,  4.68181801,  0.04444445,  0.        ,  0.        ],
       [ 1.        ,  9.5       ,  0.68888891,  0.        ,  0.        ],
       [ 1.        ,  5.81818199,  0.3888889 ,  1.        ,  0.        ],
       [ 1.        ,  6.909091  ,  0.08888889,  0.        ,  1.        ],
       [ 1.        ,  7.63636351,  0.16666667,  1.        ,  0.        ],
       [ 1.        ,  7.4545455 ,  0.12222222,  0.        ,  0.        ],
       [ 1.        ,  6.4545455 ,  0.37777779,  0.        ,  1.        ],
       [ 1.        ,  7.4545455 ,  0.43333334,  1.        ,  0.        ],
       [ 1.        ,  6.63636351,  0.06666667,  0.        ,  1.        ],
       [ 1.        ,  4.4545455 ,  0.01111111,  0.        ,  1.        ],
       [ 1.        ,  7.        ,  0.36666667,  0.        ,  0.        ],
       [ 1.        , 10.        ,  0.42222223,  0.        ,  0.        ],
       [ 1.        ,  7.13636351,  0.08888889,  0.        ,  0.        ],
       [ 1.        ,  6.409091  ,  0.17777778,  1.        ,  0.        ],
       [ 1.        ,  6.36363649,  0.16666667,  1.        ,  0.        ],
       [ 1.        ,  6.86363649,  0.32222223,  0.        ,  1.        ],
       [ 1.        ,  3.5       ,  0.        ,  0.        ,  1.        ],
       [ 1.        ,  6.63636351,  0.16666667,  0.        ,  1.        ],
       [ 1.        ,  6.        ,  0.22222222,  0.        ,  1.        ]])

Now we will do our matrix algebra calculations to get our estimated betas. Recall our formula in matrix algebra format:

$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$

We do the calculations in python:

In [ ]:
# Direct least square regression
betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
print(betas)
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
[ 5.84003086  0.39413419  0.31198749 -0.16974205 -0.92969933]

Let us try to plot some of our results

In [ ]:
y_hat2 = np.dot(X[:,0:3], betas[0:3])
y_hat2
Out[ ]:
array([7.9966617 , 8.75382759, 8.47497704, 9.60624127, 9.83750188,
       9.83102605, 7.64078443, 7.94764403, 9.50046896, 9.08323414,
       8.87975396, 8.12044543, 8.99193935, 8.9618768 , 9.88187343,
       9.94429961, 9.02543956, 9.02485702, 8.62670245, 8.40478712,
       7.68876196, 8.73824274, 8.65155068, 9.5049127 , 9.86800732,
       8.34236075, 8.49896591, 8.42675136, 8.60299061, 9.79462751,
       8.12507702, 9.89573954, 9.83334205, 8.92491025, 9.87898925,
       8.34005948, 8.45795011, 9.14663095, 9.05663831, 7.91748523,
       8.17738064, 9.12363372, 7.98810616, 7.37494053, 9.55016541,
       8.84921417, 9.16991908, 9.85760773, 7.83205486, 9.89920606,
       9.3165892 , 9.71371777, 8.69771805, 8.61685672, 9.79867676,
       8.22621773, 9.37217875, 8.71845465, 8.2470169 , 9.9527926 ,
       8.74572946, 7.66505012, 8.87576746, 7.47549861, 8.81160782,
       8.90175208, 8.46482061, 8.57341435, 8.98316226, 7.86616628,
       8.06034938, 7.94534257, 9.96267219, 9.97838217, 9.8160094 ,
       8.72551283, 8.32674704, 8.20020443, 8.14992544, 8.03491862,
       9.04795746, 9.59513385, 8.66078519, 8.97331633, 8.86592174,
       9.38024782, 8.91853087, 7.46856556, 8.25336742, 9.51687418,
       8.16194752, 7.85749996, 7.69916155, 9.79923043, 8.25450381,
       8.5908721 , 8.90178076, 8.81625395, 8.5018499 , 8.91331673,
       8.47644782, 7.59918609, 8.71336564, 9.91310086, 8.68044797,
       8.42153723, 8.40015552, 8.64575401, 7.21950055, 8.50764657,
       8.27416658])

Then I create fitted values of some of the regression lines - notice that the latitute variable is fixed at its mean value:

In [ ]:
 
In [ ]:
y_hat2 = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2])
y_asia = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2]) +  betas[3]
y_africa = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2]) +  betas[4]

fig, ax = plt.subplots()
ax.scatter(x=X[:,1], y=y)
ax.plot(X[:,1], y_hat2, color="red", label="ROW")
#ax.plot(x, y_hat2, color="orange")
ax.plot(X[:,1], y_asia, color="yellow", label="Asia")
ax.plot(X[:,1], y_africa, color="purple", label="Africa")
ax.set_xlabel("Protection again appropriation")
ax.set_ylabel("Log GDP")
ax.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x7fc36c323df0>

Above we visualized a few of the regressions. The red line represents the original regression with no controlling variable.

The yellow line represents the relationship between gdp and institional variable, keeping the effects of lattitude constant. We see that the slope is now smaller, indicating some of the effect that we placed on the institutional index actually comes from

Since we have dummy variables for Africa and Asia, the yellow line represents the regression line for "other" areas.

The purple line represents the addition of the (negative) coefficient on the Africa dummy. The fact that this line is parallel to the yellow line is an assumption in the model.

If we put in an interaction effect with the institutional variable we could also allow for the slope to be different if we believe that this is likely.

Uncertainty in the model¶

An important part of a regression model is of course also to estimate the uncertainty in both the model $\sigma^2$ and in the coefficient estimates (the standard errors (SE))

If we use the assumption that our variance is constant in our data ("homoskedasticity"), then we can calculate our covariance-variance matrix as below:

$$E((\hat{\beta}-\beta)(\hat{\beta}-\beta) | \mathbf{X}) = \sigma^2 (X^TX)^{-1}$$

Exercise:¶

Using matrix algebra, show how if we only have one exogenous regressor, how you write the formula for the standard error of $\beta_0$ and $\beta_1$ in summation format.

First we need an estimate of our $\sigma^2$, which we get by taking the variance of our residuals

In [ ]:
fitted = np.dot(X,betas)
resid = y-fitted
sigmasq_hat = np.var(resid)
sigmasq_hat
Out[ ]:
0.3731718257041534

To get the individual standard errors, we would then look at the diagnol of this matrix

Then we are ready to estimate our variance-covariance matrix

In [ ]:
vcov = sigmasq_hat*np.linalg.inv(np.dot(X.T,X))
vcov
Out[ ]:
array([[ 0.10959253, -0.01437491,  0.03069742, -0.01341976, -0.02785089],
       [-0.01437491,  0.00241174, -0.01270975,  0.00024173,  0.00180609],
       [ 0.03069742, -0.01270975,  0.18828835,  0.01165603,  0.01716112],
       [-0.01341976,  0.00024173,  0.01165603,  0.02239995,  0.01020106],
       [-0.02785089,  0.00180609,  0.01716112,  0.01020106,  0.02605804]])
In [ ]:
np.sqrt(np.diag(vcov))
Out[ ]:
array([0.33104763, 0.04910942, 0.43392205, 0.14966614, 0.16142502])

For reference, here are our estimated betas

In [ ]:
betas
Out[ ]:
array([ 5.84003086,  0.39413419,  0.31198749, -0.16974205, -0.92969933])

Recalling our statistics courses, we could then construct a rough 95 \% confidence interval for our estimate on the institutional parameter of our regression we would get:

$$CI_{95}= .39 +/- 2*.049$$

Exercise:¶

The above estimated variance-covariance matrix is only valid under some strict assumptions, including that we have constant variance. Looking at our data, this may not be the case. One way of dealing with such heteroskedasticity, or non-constant variance, is to adjust our standard error with what is called a White "Sandwich" Estimator:

$$E((\hat{\beta}-\beta)(\hat{\beta}-\beta) | \mathbf{X}) =(X^TX)^{-1}((X^T \hat{\Omega} X)^{-1})(X^TX)^{-1}$$

Where

$$ \hat{\Omega} = \begin{bmatrix} \hat{\epsilon_1}^2 & 0 & ... & 0 \\ 0 & \hat{\epsilon_2}^2 & ... & 0 \\ ... & ... & ... & ... \\ 0 & 0 & ... & \hat{\epsilon_n}^2 \end{bmatrix}$$

(Where $\hat{\epsilon_n}$ represents the nth residual)

Estimate the white "sandwhich" variance-covariance matrix of the above regression

The easy but less interesting way of doing it¶

There are of course easier ways of estimating an OLS regression. We can for example use the statsmodels package:

In [ ]:
import statsmodels.formula.api as smf

sm_mod1 = smf.ols(formula = "logpgp95 ~ avexpr + lat_abst + africa + asia", data=df2).fit()
sm_mod1.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: logpgp95 R-squared: 0.713
Model: OLS Adj. R-squared: 0.703
Method: Least Squares F-statistic: 65.97
Date: Thu, 13 Oct 2022 Prob (F-statistic): 6.64e-28
Time: 14:24:45 Log-Likelihood: -102.79
No. Observations: 111 AIC: 215.6
Df Residuals: 106 BIC: 229.1
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.8400 0.339 17.239 0.000 5.168 6.512
avexpr 0.3941 0.050 7.843 0.000 0.295 0.494
lat_abst 0.3120 0.444 0.703 0.484 -0.568 1.192
africa -0.9297 0.165 -5.628 0.000 -1.257 -0.602
asia -0.1697 0.153 -1.108 0.270 -0.473 0.134
Omnibus: 4.293 Durbin-Watson: 1.869
Prob(Omnibus): 0.117 Jarque-Bera (JB): 3.909
Skew: -0.456 Prob(JB): 0.142
Kurtosis: 3.109 Cond. No. 57.8


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Comparing the coefficients and standard errors we see we get the same results

But don't you feel better knowing how these numbers are actually calculated?

Explaining multi-colinearity (again)¶

In an earlier lab we looked at the relationship between multi-colinearity and determinants. Let us go over this again with a little more analysis.

Let's try to show the problem through simulation

We will generate a variable x1 from a normal random distribution. Then we will create a y variable that is created to be dependent on x1. We will also create an x2 variable that is simply a linear function of x1:

In [ ]:
x1 = spt.norm.rvs(2, 1, 100)
y = 5 + .8*x1 + spt.norm.rvs(0,1, 100)
x2 = -1 + .5*x1

We create a matrice of exogenous variables, including a column of 1s to represent the intercept term.

In [ ]:
X = np.array([np.ones(100), x1, x2])
X = X.T
X
Out[ ]:
array([[ 1.00000000e+00,  2.24964626e+00,  1.24823131e-01],
       [ 1.00000000e+00,  3.81456561e+00,  9.07282805e-01],
       [ 1.00000000e+00,  5.17403849e-01, -7.41298075e-01],
       [ 1.00000000e+00,  1.67431089e+00, -1.62844554e-01],
       [ 1.00000000e+00,  1.95091599e+00, -2.45420073e-02],
       [ 1.00000000e+00,  2.35926971e+00,  1.79634856e-01],
       [ 1.00000000e+00,  1.65340017e+00, -1.73299914e-01],
       [ 1.00000000e+00,  7.36555608e-01, -6.31722196e-01],
       [ 1.00000000e+00,  1.45347024e+00, -2.73264879e-01],
       [ 1.00000000e+00,  2.13130482e+00,  6.56524088e-02],
       [ 1.00000000e+00,  3.18125251e+00,  5.90626253e-01],
       [ 1.00000000e+00,  1.54054118e+00, -2.29729410e-01],
       [ 1.00000000e+00,  1.93583553e+00, -3.20822365e-02],
       [ 1.00000000e+00,  1.34317053e+00, -3.28414733e-01],
       [ 1.00000000e+00,  1.68213799e+00, -1.58931007e-01],
       [ 1.00000000e+00,  2.73227427e+00,  3.66137135e-01],
       [ 1.00000000e+00,  2.45485352e+00,  2.27426761e-01],
       [ 1.00000000e+00,  1.81812924e+00, -9.09353790e-02],
       [ 1.00000000e+00,  1.85095984e+00, -7.45200787e-02],
       [ 1.00000000e+00,  1.86706914e+00, -6.64654323e-02],
       [ 1.00000000e+00,  1.05679630e+00, -4.71601850e-01],
       [ 1.00000000e+00,  3.76551342e+00,  8.82756709e-01],
       [ 1.00000000e+00,  3.12831519e+00,  5.64157596e-01],
       [ 1.00000000e+00,  2.40769067e+00,  2.03845336e-01],
       [ 1.00000000e+00,  3.93988055e+00,  9.69940275e-01],
       [ 1.00000000e+00,  1.28011591e+00, -3.59942045e-01],
       [ 1.00000000e+00,  2.84802480e+00,  4.24012400e-01],
       [ 1.00000000e+00,  3.20400322e+00,  6.02001608e-01],
       [ 1.00000000e+00,  2.06730639e+00,  3.36531973e-02],
       [ 1.00000000e+00,  1.99254546e+00, -3.72727053e-03],
       [ 1.00000000e+00,  2.76913052e+00,  3.84565262e-01],
       [ 1.00000000e+00,  1.57794858e+00, -2.11025710e-01],
       [ 1.00000000e+00,  1.93260433e+00, -3.36978374e-02],
       [ 1.00000000e+00,  1.98356581e+00, -8.21709601e-03],
       [ 1.00000000e+00,  2.90446724e+00,  4.52233618e-01],
       [ 1.00000000e+00,  3.59677291e+00,  7.98386453e-01],
       [ 1.00000000e+00,  1.00167889e+00, -4.99160554e-01],
       [ 1.00000000e+00,  1.69688641e+00, -1.51556795e-01],
       [ 1.00000000e+00,  1.43470876e+00, -2.82645618e-01],
       [ 1.00000000e+00,  2.20500013e-01, -8.89749993e-01],
       [ 1.00000000e+00,  2.56282500e+00,  2.81412500e-01],
       [ 1.00000000e+00,  1.23594604e+00, -3.82026980e-01],
       [ 1.00000000e+00,  6.17424761e-01, -6.91287619e-01],
       [ 1.00000000e+00,  2.73810502e+00,  3.69052508e-01],
       [ 1.00000000e+00,  2.74579727e+00,  3.72898636e-01],
       [ 1.00000000e+00,  3.02555638e+00,  5.12778191e-01],
       [ 1.00000000e+00,  1.89628290e+00, -5.18585486e-02],
       [ 1.00000000e+00,  1.68030493e+00, -1.59847537e-01],
       [ 1.00000000e+00,  1.11459486e+00, -4.42702570e-01],
       [ 1.00000000e+00,  2.07450003e+00,  3.72500131e-02],
       [ 1.00000000e+00,  2.35755300e+00,  1.78776499e-01],
       [ 1.00000000e+00,  4.02177173e+00,  1.01088586e+00],
       [ 1.00000000e+00,  2.07828310e+00,  3.91415504e-02],
       [ 1.00000000e+00,  2.74733801e+00,  3.73669006e-01],
       [ 1.00000000e+00,  1.94558233e+00, -2.72088348e-02],
       [ 1.00000000e+00,  1.90050180e+00, -4.97491000e-02],
       [ 1.00000000e+00,  1.67819002e+00, -1.60904989e-01],
       [ 1.00000000e+00,  4.58211520e-01, -7.70894240e-01],
       [ 1.00000000e+00,  4.01781179e+00,  1.00890590e+00],
       [ 1.00000000e+00,  3.93879475e+00,  9.69397377e-01],
       [ 1.00000000e+00,  1.08592138e+00, -4.57039310e-01],
       [ 1.00000000e+00,  1.38425388e+00, -3.07873060e-01],
       [ 1.00000000e+00,  5.41010222e-01, -7.29494889e-01],
       [ 1.00000000e+00,  2.59532318e+00,  2.97661588e-01],
       [ 1.00000000e+00,  3.49813957e+00,  7.49069784e-01],
       [ 1.00000000e+00,  2.39118159e+00,  1.95590795e-01],
       [ 1.00000000e+00,  2.90932588e+00,  4.54662939e-01],
       [ 1.00000000e+00,  2.34337239e+00,  1.71686194e-01],
       [ 1.00000000e+00,  1.01692573e+00, -4.91537136e-01],
       [ 1.00000000e+00,  3.49265880e-01, -8.25367060e-01],
       [ 1.00000000e+00,  1.94903574e+00, -2.54821292e-02],
       [ 1.00000000e+00, -1.03993561e+00, -1.51996780e+00],
       [ 1.00000000e+00,  4.65642653e-01, -7.67178674e-01],
       [ 1.00000000e+00,  9.89880968e-01, -5.05059516e-01],
       [ 1.00000000e+00,  2.15551741e+00,  7.77587053e-02],
       [ 1.00000000e+00,  1.81800904e+00, -9.09954775e-02],
       [ 1.00000000e+00,  5.78740105e-01, -7.10629947e-01],
       [ 1.00000000e+00,  1.30734770e+00, -3.46326151e-01],
       [ 1.00000000e+00,  3.18507972e+00,  5.92539858e-01],
       [ 1.00000000e+00,  2.50197831e+00,  2.50989156e-01],
       [ 1.00000000e+00, -1.45167784e-01, -1.07258389e+00],
       [ 1.00000000e+00, -2.42449095e-01, -1.12122455e+00],
       [ 1.00000000e+00,  2.87677921e+00,  4.38389607e-01],
       [ 1.00000000e+00,  3.98254834e+00,  9.91274171e-01],
       [ 1.00000000e+00,  2.35161575e+00,  1.75807873e-01],
       [ 1.00000000e+00,  2.46438420e+00,  2.32192102e-01],
       [ 1.00000000e+00,  7.70736933e-01, -6.14631533e-01],
       [ 1.00000000e+00,  1.90955118e+00, -4.52244102e-02],
       [ 1.00000000e+00,  1.29065910e+00, -3.54670452e-01],
       [ 1.00000000e+00,  1.26041117e+00, -3.69794415e-01],
       [ 1.00000000e+00,  2.26250133e-01, -8.86874933e-01],
       [ 1.00000000e+00,  2.33058399e+00,  1.65291996e-01],
       [ 1.00000000e+00,  1.70372000e+00, -1.48140001e-01],
       [ 1.00000000e+00,  1.21666149e+00, -3.91669255e-01],
       [ 1.00000000e+00,  2.20075770e+00,  1.00378848e-01],
       [ 1.00000000e+00,  1.07282043e+00, -4.63589787e-01],
       [ 1.00000000e+00,  2.53076536e+00,  2.65382681e-01],
       [ 1.00000000e+00,  1.84841750e+00, -7.57912502e-02],
       [ 1.00000000e+00,  3.51701107e+00,  7.58505537e-01],
       [ 1.00000000e+00,  4.02116854e+00,  1.01058427e+00]])

Recall our formula for our OLS coefficients:

$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$

And let's especially take a look at the term $\mathbf{(X^TX)^{-1}}$

Here we need to find the inverse of the matrix multiplication of $\mathbf{X}$ times its transpose.

Recall that in order for an inverse to exist, then the determinant must be non-zero. let's take a look:

In [ ]:
XTX = np.dot(X.T,X)
XTX
Out[ ]:
array([[100.        , 197.80413845,  -1.09793077],
       [197.80413845, 499.93708703,  52.16440506],
       [ -1.09793077,  52.16440506,  27.1801333 ]])
In [ ]:
np.linalg.det(XTX)
Out[ ]:
-4.246897825393734e-10

Because of the way that a computer makes calculations (called floating point arithmetic), you can not get exact numbers. But if we look at the determinant above, we notice that it gets extremly close to zero. As a practical matter, the determinant is zero and the matrix $\mathbf{X^TX}$ is singular. That is to say, that an inverse does not exist.

Let's see what happens if we try to calculate the inverse anyways:

In [ ]:
np.linalg.inv(np.dot(X.T,X))
Out[ ]:
array([[-2.55886342e+13,  1.27943171e+13, -2.55886342e+13],
       [ 1.27943171e+13, -6.39715856e+12,  1.27943171e+13],
       [-2.55886342e+13,  1.27943171e+13, -2.55886342e+13]])

Here we see that the calculated inverse ends up exploding to huge numbers. To try to add some intuition to what is happening, the term $\mathbf{X^TX}$ is used in calcuating the variance and covariance of the exogenous variables. This in term is used to calculate both coefficients and standard errors. The result is coefficients that are highly unreliable and very large standard errors.

We can caculate the coefficients below, putting the calculation in a function:

In [ ]:
def calcCoefs(X, y):
    betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
    return(betas)

calcCoefs(X,y)
Out[ ]:
array([ 0.23564321,  2.4245653 , -3.51814661])

We can also try to calculate the standard errors.

In [ ]:
def calcSE(X, y):
    betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
    fitted = np.dot(X,betas)
    resid = y-fitted
    sigmasq_hat = np.var(resid)
    vcov = sigmasq_hat*np.linalg.inv(np.dot(X.T,X))
    return(np.sqrt(np.diag(vcov)))
   
calcSE(X, y)     
/var/folders/fs/kr83xdyx2jq_8gww_xfmtxfr0000gn/T/ipykernel_37476/1324029103.py:7: RuntimeWarning: invalid value encountered in sqrt
  return(np.sqrt(np.diag(vcov)))
Out[ ]:
array([nan, nan, nan])

Here we see that we end up problems calculating the standard deviation.

Now let's do a simple change to our data generation. We add some random noise to the x2 variable. Now x1 and x2 are highly correlated, but there is no longer perfect multicolinearity

In [ ]:
x1 = spt.norm.rvs(2, 1, 100)
x2 = -1 + .5*x1 + spt.norm.rvs(0,1, 100)
y = 5 + .8*x1 + spt.norm.rvs(0,1, 100)

X = np.array([np.ones(100), x1, x2])
X = X.T
In [ ]:
calcCoefs(X, y)
Out[ ]:
array([ 4.81708036,  0.90988719, -0.055699  ])
In [ ]:
calcSE(X,y)
Out[ ]:
array([0.24051707, 0.11395535, 0.09371561])

Now we manage to estimate a coefficient and standard error on the x1 variable that are reasonable.

Assignment¶

  • Find another data set with an interesting question you could explore with an OLS regression.

  • Manually (with matrix algebra) calculate your beta coefficients as well the var-covariance matrix.

  • Construct 95% confidence intervalls for your beta coefficients. Are the coefficients statistically significant at a 95% level

  • Interpret your results, including with a visualisation

  • Show a plot of the fitted values versus the residuals. What does this tell you about the model?

  • Validate your results by using statsmodels

  • Describe (potential) problems with the regressions. Recall from your previous statistics courses that simple OLS is ubiased, consistent and efficent subject to several assumptions: Correct specification, strict exogeneity, X variables are linearly independent, homoscedasticity, and normality. Are these reasonable assumptions in your case? (The assumptions are placed in rough order of importance.)

In [ ]: