One of the major applications of Linear Algebra techniques is in data analysis. Basically every non-trivial form of regression or machine learning algorithm relies, to some degree, on the basics of matrix algebra.
Ordinary Least Squares (OLS) regression is one of the most common forms of regression and also the one with the closest connection to pure matrix algebra.
Often the actual matrix algebra of OLS is done within a pre-packaged function. In this lab we will work through the actual steps of setting up and computing the OLS coefficients and variance-covariance matrix.
We start with the simplist case where we have one dependent variable ($y$) and one explanatory variables ($x$)
Our goal is to estimate $\beta_0$ and $\beta_1$: we will denote estimates with a "hat": $\hat{\beta_0}$, $\hat{\beta_1}$.
As the name of the methodology implies, we will estimate the betas by minimizing the squared residuals. Thus if we define the residual as:
$$e_i = y_i-(\beta_0 + \beta_1 x_i)$$Then our objective function is:
$$ \min_{\beta_0, \beta_1} \sum_i^n e_i^2 $$Which we can rewrite
$$ \min_{\beta_0, \beta_1} S = \sum_i^n (y_i-\beta_0 - \beta_1 x_i)^2 $$We then take the derivative of this function (using the chain rule) with respect to $\beta_0$ and $\beta_1$ to get our first order conditions:
$$\frac{\delta S}{\delta \beta_0} = -2 \sum_i (y_i-\beta_0 - \beta_1 x_i) = 0$$$$\frac{\delta S}{\delta \beta_1} = -2 \sum_i x_i(y_i-\beta_0 - \beta_1 x_i) = 0$$From these two conditions, we can then find the $\hat{\beta_0}$ and $\hat{\beta_1}$ that minimizes our function:
Starting with the first condition, which we can rewrite
$$\sum_i^n y_i = \sum_i^n \hat{\beta_0} + \sum_i^n \hat{\beta_1} x_i $$$$\sum_i^n y_i = n*\hat{\beta_0} + \hat{\beta_1} \sum_i^n x_i$$And then from the second first-order condition we can write:
$$\sum_i^n x_i y_i = \hat{\beta_0} \sum_i^n x_i + \hat{\beta_1} \sum_{i} ^n x_i^2$$Solving for $\hat{\beta_0}$ and $\hat{\beta_1}$ we end up with
$$\hat{\beta_0} = \frac{(\sum_i^n x_i^2)(\sum_i^n y_i)-(\sum_i^n x_i)(\sum_i^n x_i y_i)}{n \sum x_i^2 - (\sum x_i)^2}$$and
$$\hat{\beta_1} = \frac{n \sum_i^n x_i y_i - (\sum_i^n x_i)(\sum_i^n y_i)}{n \sum_i^n x_i^2 - (\sum_i^n x_i)^2}$$Show the steps between the FOC's and the two equations shown above
First defining variance and covariance, we can rewrite the above equations for our least squares estimates as:
We can now manually calculate out an OLS regression with one explanatory variable. As illustration we use the example from QuantEcon, which looks at the correlation between economic outcomes, as represented by log GDP and political institutions that protect property rights, represented by an index that measures a countries political protections against expropriation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as spt
We will use the package Pandas to read in our data and store the data set in a data frame. If you are new to Pandas, this lab from my applied statistics course might be helpful.
df1 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true')
df1.head()
shortnam | euro1900 | excolony | avexpr | logpgp95 | cons1 | cons90 | democ00a | cons00a | extmort4 | logem4 | loghjypl | baseco | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | 0.000000 | 1.0 | NaN | NaN | 1.0 | 2.0 | 1.0 | 1.0 | 93.699997 | 4.540098 | NaN | NaN |
1 | AGO | 8.000000 | 1.0 | 5.363636 | 7.770645 | 3.0 | 3.0 | 0.0 | 1.0 | 280.000000 | 5.634789 | -3.411248 | 1.0 |
2 | ARE | 0.000000 | 1.0 | 7.181818 | 9.804219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ARG | 60.000004 | 1.0 | 6.386364 | 9.133459 | 1.0 | 6.0 | 3.0 | 3.0 | 68.900002 | 4.232656 | -0.872274 | 1.0 |
4 | ARM | 0.000000 | 0.0 | NaN | 7.682482 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
There are a fair amount of missing data in our variables of interest, so we will filter out those first
df1 = df1.dropna(subset=['logpgp95', 'avexpr']).copy()
We can do a quick plot
df1.plot.scatter(x="avexpr", y="logpgp95")
<Axes: xlabel='avexpr', ylabel='logpgp95'>
Now we'll pull out our x and y variables and transform them into np arrays:
y = np.array(df1.logpgp95)
x = np.array(df1.avexpr)
We can estimate variance and covariance
var_x = np.mean(x-np.mean(x)**2)
var_y = np.mean(x-np.mean(x)**2)
cov_xy = np.mean((x-np.mean(x))*(y-np.mean(y)))
#or using built in functions
var_x = np.var(x)
var_y = np.var(y)
cov_xy = np.cov([x,y])
#notice the cov functions also gives us our variances
cov_xy
cov_xy = cov_xy[0,1]
cov_xy
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
1.5103186257654644
We can then estimate our coefficients
beta_1 = cov_xy/var_x
beta_0 = np.mean(y) - beta_1*np.mean(x)
print("beta_0: ", beta_0, "beta_1: ", beta_1 )
beta_0: 4.591243539073464 beta_1: 0.5367065373944262
We can then create our fitted values, y_hat, and plot that together with our data
y_hat = beta_0 + beta_1*x
fig, ax = plt.subplots()
ax.scatter(x=x, y=y)
ax.plot(x, y_hat, color="red")
ax.set_xlabel("Protection against appropriation")
ax.set_ylabel("Log GDP")
Text(0, 0.5, 'Log GDP')
Now we consider a case where we have 3 exogenous variables, thus we have a model we could write as:
$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_p x_{i3} + \epsilon_i$$This we can write in matrix form
$$\begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & x_{21} & x_{22} & x_{23}\\ . & . & . & . \\ 1 & x_{n1} & x_{n2} & x_{n3} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_1 \\ ... \\ \epsilon_n \end{bmatrix} $$Which we can write compactly as
$$\mathbf{y} = \mathbf{X\beta + \epsilon}$$We can write our solution as:
$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$With a model with only 1 explanatory variable confirm, using matrix multiplication, that the matrix formulation of the solution for the $\beta$ is equivalent to the solution we described earlier
Now we'll compute our beta's using our matrix algebra approach. We will download a data set again looking at the relationship between economic outcomes (GDP) and protection against expropriation index (avexpr). We also include latitude (lat_abst) to controll for the effects of climate. We will also include indicator variables for asia and africa.
df2 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable2.dta?raw=true')
df2
shortnam | africa | lat_abst | avexpr | logpgp95 | other | asia | loghjypl | baseco | |
---|---|---|---|---|---|---|---|---|---|
0 | AFG | 0.0 | 0.366667 | NaN | NaN | 0.0 | 1.0 | NaN | NaN |
1 | AGO | 1.0 | 0.136667 | 5.363636 | 7.770645 | 0.0 | 0.0 | -3.411248 | 1.0 |
2 | ARE | 0.0 | 0.266667 | 7.181818 | 9.804219 | 0.0 | 1.0 | NaN | NaN |
3 | ARG | 0.0 | 0.377778 | 6.386364 | 9.133459 | 0.0 | 0.0 | -0.872274 | 1.0 |
4 | ARM | 0.0 | 0.444444 | NaN | 7.682482 | 0.0 | 1.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
158 | YUG | 0.0 | 0.488889 | 6.318182 | NaN | 0.0 | 0.0 | -1.203973 | NaN |
159 | ZAF | 1.0 | 0.322222 | 6.863636 | 8.885994 | 0.0 | 0.0 | -1.386294 | 1.0 |
160 | ZAR | 1.0 | 0.000000 | 3.500000 | 6.866933 | 0.0 | 0.0 | -3.411248 | 1.0 |
161 | ZMB | 1.0 | 0.166667 | 6.636364 | 6.813445 | 0.0 | 0.0 | -2.975930 | NaN |
162 | ZWE | 1.0 | 0.222222 | 6.000000 | 7.696213 | 0.0 | 0.0 | -2.733368 | NaN |
163 rows × 9 columns
We will again drop rows with missing data. We will also add a column with just 1's - this will represent the estimation of the constant term, $\beta_0$
df2 = df2.dropna(subset=['logpgp95', 'avexpr', 'lat_abst', 'asia', 'africa']).copy()
df2["const"] = 1
y = np.array(df2.loc[:,'logpgp95'])
X = np.array(df2.loc[:,['const', 'avexpr', 'lat_abst', 'asia', 'africa']])
X
array([[ 1. , 5.36363649, 0.13666667, 0. , 1. ], [ 1. , 7.18181801, 0.26666668, 1. , 0. ], [ 1. , 6.38636351, 0.37777779, 0. , 0. ], [ 1. , 9.31818199, 0.30000001, 0. , 0. ], [ 1. , 9.72727299, 0.52444446, 0. , 0. ], [ 1. , 9.68181801, 0.56111109, 0. , 0. ], [ 1. , 4.4545455 , 0.14444445, 0. , 1. ], [ 1. , 5.13636351, 0.26666668, 1. , 0. ], [ 1. , 8.909091 , 0.47777778, 0. , 0. ], [ 1. , 8. , 0.2888889 , 1. , 0. ], [ 1. , 7.5 , 0.26833332, 0. , 0. ], [ 1. , 5.63636351, 0.18888889, 0. , 0. ], [ 1. , 7.909091 , 0.11111111, 0. , 0. ], [ 1. , 7.72727251, 0.24444444, 0. , 1. ], [ 1. , 9.72727299, 0.66666669, 0. , 0. ], [ 1. , 10. , 0.52222222, 0. , 0. ], [ 1. , 7.81818199, 0.33333334, 0. , 0. ], [ 1. , 7.77272749, 0.3888889 , 1. , 0. ], [ 1. , 7. , 0.08888889, 0. , 1. ], [ 1. , 6.4545455 , 0.06666667, 0. , 1. ], [ 1. , 4.68181801, 0.01111111, 0. , 1. ], [ 1. , 7.31818199, 0.04444445, 0. , 0. ], [ 1. , 7.0454545 , 0.11111111, 0. , 0. ], [ 1. , 8.86363602, 0.54944444, 0. , 0. ], [ 1. , 9.72727299, 0.62222224, 0. , 0. ], [ 1. , 6.18181801, 0.21111111, 0. , 0. ], [ 1. , 6.5 , 0.31111112, 0. , 1. ], [ 1. , 6.5454545 , 0.02222222, 0. , 0. ], [ 1. , 6.77272749, 0.30000001, 0. , 1. ], [ 1. , 9.68181801, 0.44444445, 0. , 0. ], [ 1. , 5.72727251, 0.08888889, 0. , 1. ], [ 1. , 9.72727299, 0.71111113, 0. , 0. ], [ 1. , 9.72727299, 0.51111114, 0. , 0. ], [ 1. , 7.81818199, 0.01111111, 0. , 1. ], [ 1. , 9.77272701, 0.60000002, 0. , 0. ], [ 1. , 6.27272749, 0.08888889, 0. , 1. ], [ 1. , 6.5454545 , 0.12222222, 0. , 1. ], [ 1. , 8.27272701, 0.14755556, 0. , 1. ], [ 1. , 7.81818199, 0.43333334, 0. , 0. ], [ 1. , 5.13636351, 0.17 , 0. , 0. ], [ 1. , 5.88636351, 0.05555556, 0. , 0. ], [ 1. , 8.13636398, 0.24611111, 1. , 0. ], [ 1. , 5.31818199, 0.16666667, 0. , 0. ], [ 1. , 3.72727275, 0.21111111, 0. , 0. ], [ 1. , 9. , 0.52222222, 0. , 0. ], [ 1. , 7.590909 , 0.0556 , 1. , 0. ], [ 1. , 8.27272701, 0.22220001, 1. , 0. ], [ 1. , 9.72727299, 0.58888888, 0. , 0. ], [ 1. , 4.77272749, 0.35555556, 1. , 0. ], [ 1. , 9.72727299, 0.72222221, 0. , 0. ], [ 1. , 8.54545498, 0.34777778, 1. , 0. ], [ 1. , 9.45454502, 0.47222221, 0. , 0. ], [ 1. , 7.090909 , 0.20166667, 0. , 0. ], [ 1. , 6.77272749, 0.34444445, 1. , 0. ], [ 1. , 9.72727299, 0.40000001, 1. , 0. ], [ 1. , 6.0454545 , 0.01111111, 0. , 1. ], [ 1. , 8.63636398, 0.41111112, 1. , 0. ], [ 1. , 7.0454545 , 0.32555553, 1. , 0. ], [ 1. , 6.0454545 , 0.07777778, 1. , 0. ], [ 1. , 10. , 0.54944444, 0. , 0. ], [ 1. , 7.090909 , 0.35555556, 0. , 1. ], [ 1. , 4.4545455 , 0.22222222, 0. , 1. ], [ 1. , 7.5 , 0.25555557, 0. , 0. ], [ 1. , 4. , 0.18888889, 0. , 1. ], [ 1. , 7.22727251, 0.39444444, 0. , 0. ], [ 1. , 7.36363649, 0.51111114, 1. , 0. ], [ 1. , 6.5 , 0.20166667, 0. , 1. ], [ 1. , 6.81818199, 0.14777778, 0. , 1. ], [ 1. , 7.9545455 , 0.02555555, 1. , 0. ], [ 1. , 5. , 0.17777778, 0. , 1. ], [ 1. , 5.5454545 , 0.11111111, 0. , 1. ], [ 1. , 5.22727251, 0.14444445, 0. , 0. ], [ 1. , 10. , 0.58111107, 0. , 0. ], [ 1. , 9.95454502, 0.68888891, 0. , 0. ], [ 1. , 9.72727299, 0.45555556, 0. , 0. ], [ 1. , 7.13636351, 0.23333333, 1. , 0. ], [ 1. , 6.0454545 , 0.33333334, 1. , 0. ], [ 1. , 5.909091 , 0.1 , 0. , 0. ], [ 1. , 5.77272749, 0.11111111, 0. , 0. ], [ 1. , 5.4545455 , 0.14444445, 1. , 0. ], [ 1. , 7.68181801, 0.5777778 , 0. , 0. ], [ 1. , 9.18181801, 0.43666667, 0. , 0. ], [ 1. , 6.9545455 , 0.25555557, 0. , 0. ], [ 1. , 7.72727251, 0.28111109, 1. , 0. ], [ 1. , 7.27272749, 0.51111114, 0. , 0. ], [ 1. , 8.45454502, 0.66666669, 0. , 0. ], [ 1. , 7.590909 , 0.27777779, 1. , 0. ], [ 1. , 4. , 0.16666667, 0. , 1. ], [ 1. , 6. , 0.15555556, 0. , 1. ], [ 1. , 9.31818199, 0.01355556, 1. , 0. ], [ 1. , 5.81818199, 0.09222222, 0. , 1. ], [ 1. , 5. , 0.15000001, 0. , 0. ], [ 1. , 4.68181801, 0.04444445, 0. , 0. ], [ 1. , 9.5 , 0.68888891, 0. , 0. ], [ 1. , 5.81818199, 0.3888889 , 1. , 0. ], [ 1. , 6.909091 , 0.08888889, 0. , 1. ], [ 1. , 7.63636351, 0.16666667, 1. , 0. ], [ 1. , 7.4545455 , 0.12222222, 0. , 0. ], [ 1. , 6.4545455 , 0.37777779, 0. , 1. ], [ 1. , 7.4545455 , 0.43333334, 1. , 0. ], [ 1. , 6.63636351, 0.06666667, 0. , 1. ], [ 1. , 4.4545455 , 0.01111111, 0. , 1. ], [ 1. , 7. , 0.36666667, 0. , 0. ], [ 1. , 10. , 0.42222223, 0. , 0. ], [ 1. , 7.13636351, 0.08888889, 0. , 0. ], [ 1. , 6.409091 , 0.17777778, 1. , 0. ], [ 1. , 6.36363649, 0.16666667, 1. , 0. ], [ 1. , 6.86363649, 0.32222223, 0. , 1. ], [ 1. , 3.5 , 0. , 0. , 1. ], [ 1. , 6.63636351, 0.16666667, 0. , 1. ], [ 1. , 6. , 0.22222222, 0. , 1. ]])
Now we will do our matrix algebra calculations to get our estimated betas. Recall our formula in matrix algebra format:
$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$We do the calculations in python:
# Direct least square regression
betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
print(betas)
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions. [ 5.84003086 0.39413419 0.31198749 -0.16974205 -0.92969933]
Let us try to plot some of our results
y_hat2 = np.dot(X[:,0:3], betas[0:3])
y_hat2
array([7.9966617 , 8.75382759, 8.47497704, 9.60624127, 9.83750188, 9.83102605, 7.64078443, 7.94764403, 9.50046896, 9.08323414, 8.87975396, 8.12044543, 8.99193935, 8.9618768 , 9.88187343, 9.94429961, 9.02543956, 9.02485702, 8.62670245, 8.40478712, 7.68876196, 8.73824274, 8.65155068, 9.5049127 , 9.86800732, 8.34236075, 8.49896591, 8.42675136, 8.60299061, 9.79462751, 8.12507702, 9.89573954, 9.83334205, 8.92491025, 9.87898925, 8.34005948, 8.45795011, 9.14663095, 9.05663831, 7.91748523, 8.17738064, 9.12363372, 7.98810616, 7.37494053, 9.55016541, 8.84921417, 9.16991908, 9.85760773, 7.83205486, 9.89920606, 9.3165892 , 9.71371777, 8.69771805, 8.61685672, 9.79867676, 8.22621773, 9.37217875, 8.71845465, 8.2470169 , 9.9527926 , 8.74572946, 7.66505012, 8.87576746, 7.47549861, 8.81160782, 8.90175208, 8.46482061, 8.57341435, 8.98316226, 7.86616628, 8.06034938, 7.94534257, 9.96267219, 9.97838217, 9.8160094 , 8.72551283, 8.32674704, 8.20020443, 8.14992544, 8.03491862, 9.04795746, 9.59513385, 8.66078519, 8.97331633, 8.86592174, 9.38024782, 8.91853087, 7.46856556, 8.25336742, 9.51687418, 8.16194752, 7.85749996, 7.69916155, 9.79923043, 8.25450381, 8.5908721 , 8.90178076, 8.81625395, 8.5018499 , 8.91331673, 8.47644782, 7.59918609, 8.71336564, 9.91310086, 8.68044797, 8.42153723, 8.40015552, 8.64575401, 7.21950055, 8.50764657, 8.27416658])
Then I create fitted values of some of the regression lines - notice that the latitute variable is fixed at its mean value:
y_hat2 = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2])
y_asia = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2]) + betas[3]
y_africa = betas[0] + betas[1]*X[:,1] + betas[2]*np.mean(X[:,2]) + betas[4]
fig, ax = plt.subplots()
ax.scatter(x=X[:,1], y=y)
ax.plot(X[:,1], y_hat2, color="red", label="ROW")
#ax.plot(x, y_hat2, color="orange")
ax.plot(X[:,1], y_asia, color="yellow", label="Asia")
ax.plot(X[:,1], y_africa, color="purple", label="Africa")
ax.set_xlabel("Protection again appropriation")
ax.set_ylabel("Log GDP")
ax.legend()
<matplotlib.legend.Legend at 0x7fc36c323df0>
Above we visualized a few of the regressions. The red line represents the original regression with no controlling variable.
The yellow line represents the relationship between gdp and institional variable, keeping the effects of lattitude constant. We see that the slope is now smaller, indicating some of the effect that we placed on the institutional index actually comes from
Since we have dummy variables for Africa and Asia, the yellow line represents the regression line for "other" areas.
The purple line represents the addition of the (negative) coefficient on the Africa dummy. The fact that this line is parallel to the yellow line is an assumption in the model.
If we put in an interaction effect with the institutional variable we could also allow for the slope to be different if we believe that this is likely.
An important part of a regression model is of course also to estimate the uncertainty in both the model $\sigma^2$ and in the coefficient estimates (the standard errors (SE))
If we use the assumption that our variance is constant in our data ("homoskedasticity"), then we can calculate our covariance-variance matrix as below:
$$E((\hat{\beta}-\beta)(\hat{\beta}-\beta) | \mathbf{X}) = \sigma^2 (X^TX)^{-1}$$Using matrix algebra, show how if we only have one exogenous regressor, how you write the formula for the standard error of $\beta_0$ and $\beta_1$ in summation format.
First we need an estimate of our $\sigma^2$, which we get by taking the variance of our residuals
fitted = np.dot(X,betas)
resid = y-fitted
sigmasq_hat = np.var(resid)
sigmasq_hat
0.3731718257041534
To get the individual standard errors, we would then look at the diagnol of this matrix
Then we are ready to estimate our variance-covariance matrix
vcov = sigmasq_hat*np.linalg.inv(np.dot(X.T,X))
vcov
array([[ 0.10959253, -0.01437491, 0.03069742, -0.01341976, -0.02785089], [-0.01437491, 0.00241174, -0.01270975, 0.00024173, 0.00180609], [ 0.03069742, -0.01270975, 0.18828835, 0.01165603, 0.01716112], [-0.01341976, 0.00024173, 0.01165603, 0.02239995, 0.01020106], [-0.02785089, 0.00180609, 0.01716112, 0.01020106, 0.02605804]])
np.sqrt(np.diag(vcov))
array([0.33104763, 0.04910942, 0.43392205, 0.14966614, 0.16142502])
For reference, here are our estimated betas
betas
array([ 5.84003086, 0.39413419, 0.31198749, -0.16974205, -0.92969933])
Recalling our statistics courses, we could then construct a rough 95 \% confidence interval for our estimate on the institutional parameter of our regression we would get:
$$CI_{95}= .39 +/- 2*.049$$The above estimated variance-covariance matrix is only valid under some strict assumptions, including that we have constant variance. Looking at our data, this may not be the case. One way of dealing with such heteroskedasticity, or non-constant variance, is to adjust our standard error with what is called a White "Sandwich" Estimator:
$$E((\hat{\beta}-\beta)(\hat{\beta}-\beta) | \mathbf{X}) =(X^TX)^{-1}((X^T \hat{\Omega} X)^{-1})(X^TX)^{-1}$$Where
$$ \hat{\Omega} = \begin{bmatrix} \hat{\epsilon_1}^2 & 0 & ... & 0 \\ 0 & \hat{\epsilon_2}^2 & ... & 0 \\ ... & ... & ... & ... \\ 0 & 0 & ... & \hat{\epsilon_n}^2 \end{bmatrix}$$(Where $\hat{\epsilon_n}$ represents the nth residual)
Estimate the white "sandwhich" variance-covariance matrix of the above regression
There are of course easier ways of estimating an OLS regression. We can for example use the statsmodels package:
import statsmodels.formula.api as smf
sm_mod1 = smf.ols(formula = "logpgp95 ~ avexpr + lat_abst + africa + asia", data=df2).fit()
sm_mod1.summary()
Dep. Variable: | logpgp95 | R-squared: | 0.713 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.703 |
Method: | Least Squares | F-statistic: | 65.97 |
Date: | Thu, 13 Oct 2022 | Prob (F-statistic): | 6.64e-28 |
Time: | 14:24:45 | Log-Likelihood: | -102.79 |
No. Observations: | 111 | AIC: | 215.6 |
Df Residuals: | 106 | BIC: | 229.1 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.8400 | 0.339 | 17.239 | 0.000 | 5.168 | 6.512 |
avexpr | 0.3941 | 0.050 | 7.843 | 0.000 | 0.295 | 0.494 |
lat_abst | 0.3120 | 0.444 | 0.703 | 0.484 | -0.568 | 1.192 |
africa | -0.9297 | 0.165 | -5.628 | 0.000 | -1.257 | -0.602 |
asia | -0.1697 | 0.153 | -1.108 | 0.270 | -0.473 | 0.134 |
Omnibus: | 4.293 | Durbin-Watson: | 1.869 |
---|---|---|---|
Prob(Omnibus): | 0.117 | Jarque-Bera (JB): | 3.909 |
Skew: | -0.456 | Prob(JB): | 0.142 |
Kurtosis: | 3.109 | Cond. No. | 57.8 |
Comparing the coefficients and standard errors we see we get the same results
But don't you feel better knowing how these numbers are actually calculated?
In an earlier lab we looked at the relationship between multi-colinearity and determinants. Let us go over this again with a little more analysis.
Let's try to show the problem through simulation
We will generate a variable x1 from a normal random distribution. Then we will create a y variable that is created to be dependent on x1. We will also create an x2 variable that is simply a linear function of x1:
x1 = spt.norm.rvs(2, 1, 100)
y = 5 + .8*x1 + spt.norm.rvs(0,1, 100)
x2 = -1 + .5*x1
We create a matrice of exogenous variables, including a column of 1s to represent the intercept term.
X = np.array([np.ones(100), x1, x2])
X = X.T
X
array([[ 1.00000000e+00, 2.24964626e+00, 1.24823131e-01], [ 1.00000000e+00, 3.81456561e+00, 9.07282805e-01], [ 1.00000000e+00, 5.17403849e-01, -7.41298075e-01], [ 1.00000000e+00, 1.67431089e+00, -1.62844554e-01], [ 1.00000000e+00, 1.95091599e+00, -2.45420073e-02], [ 1.00000000e+00, 2.35926971e+00, 1.79634856e-01], [ 1.00000000e+00, 1.65340017e+00, -1.73299914e-01], [ 1.00000000e+00, 7.36555608e-01, -6.31722196e-01], [ 1.00000000e+00, 1.45347024e+00, -2.73264879e-01], [ 1.00000000e+00, 2.13130482e+00, 6.56524088e-02], [ 1.00000000e+00, 3.18125251e+00, 5.90626253e-01], [ 1.00000000e+00, 1.54054118e+00, -2.29729410e-01], [ 1.00000000e+00, 1.93583553e+00, -3.20822365e-02], [ 1.00000000e+00, 1.34317053e+00, -3.28414733e-01], [ 1.00000000e+00, 1.68213799e+00, -1.58931007e-01], [ 1.00000000e+00, 2.73227427e+00, 3.66137135e-01], [ 1.00000000e+00, 2.45485352e+00, 2.27426761e-01], [ 1.00000000e+00, 1.81812924e+00, -9.09353790e-02], [ 1.00000000e+00, 1.85095984e+00, -7.45200787e-02], [ 1.00000000e+00, 1.86706914e+00, -6.64654323e-02], [ 1.00000000e+00, 1.05679630e+00, -4.71601850e-01], [ 1.00000000e+00, 3.76551342e+00, 8.82756709e-01], [ 1.00000000e+00, 3.12831519e+00, 5.64157596e-01], [ 1.00000000e+00, 2.40769067e+00, 2.03845336e-01], [ 1.00000000e+00, 3.93988055e+00, 9.69940275e-01], [ 1.00000000e+00, 1.28011591e+00, -3.59942045e-01], [ 1.00000000e+00, 2.84802480e+00, 4.24012400e-01], [ 1.00000000e+00, 3.20400322e+00, 6.02001608e-01], [ 1.00000000e+00, 2.06730639e+00, 3.36531973e-02], [ 1.00000000e+00, 1.99254546e+00, -3.72727053e-03], [ 1.00000000e+00, 2.76913052e+00, 3.84565262e-01], [ 1.00000000e+00, 1.57794858e+00, -2.11025710e-01], [ 1.00000000e+00, 1.93260433e+00, -3.36978374e-02], [ 1.00000000e+00, 1.98356581e+00, -8.21709601e-03], [ 1.00000000e+00, 2.90446724e+00, 4.52233618e-01], [ 1.00000000e+00, 3.59677291e+00, 7.98386453e-01], [ 1.00000000e+00, 1.00167889e+00, -4.99160554e-01], [ 1.00000000e+00, 1.69688641e+00, -1.51556795e-01], [ 1.00000000e+00, 1.43470876e+00, -2.82645618e-01], [ 1.00000000e+00, 2.20500013e-01, -8.89749993e-01], [ 1.00000000e+00, 2.56282500e+00, 2.81412500e-01], [ 1.00000000e+00, 1.23594604e+00, -3.82026980e-01], [ 1.00000000e+00, 6.17424761e-01, -6.91287619e-01], [ 1.00000000e+00, 2.73810502e+00, 3.69052508e-01], [ 1.00000000e+00, 2.74579727e+00, 3.72898636e-01], [ 1.00000000e+00, 3.02555638e+00, 5.12778191e-01], [ 1.00000000e+00, 1.89628290e+00, -5.18585486e-02], [ 1.00000000e+00, 1.68030493e+00, -1.59847537e-01], [ 1.00000000e+00, 1.11459486e+00, -4.42702570e-01], [ 1.00000000e+00, 2.07450003e+00, 3.72500131e-02], [ 1.00000000e+00, 2.35755300e+00, 1.78776499e-01], [ 1.00000000e+00, 4.02177173e+00, 1.01088586e+00], [ 1.00000000e+00, 2.07828310e+00, 3.91415504e-02], [ 1.00000000e+00, 2.74733801e+00, 3.73669006e-01], [ 1.00000000e+00, 1.94558233e+00, -2.72088348e-02], [ 1.00000000e+00, 1.90050180e+00, -4.97491000e-02], [ 1.00000000e+00, 1.67819002e+00, -1.60904989e-01], [ 1.00000000e+00, 4.58211520e-01, -7.70894240e-01], [ 1.00000000e+00, 4.01781179e+00, 1.00890590e+00], [ 1.00000000e+00, 3.93879475e+00, 9.69397377e-01], [ 1.00000000e+00, 1.08592138e+00, -4.57039310e-01], [ 1.00000000e+00, 1.38425388e+00, -3.07873060e-01], [ 1.00000000e+00, 5.41010222e-01, -7.29494889e-01], [ 1.00000000e+00, 2.59532318e+00, 2.97661588e-01], [ 1.00000000e+00, 3.49813957e+00, 7.49069784e-01], [ 1.00000000e+00, 2.39118159e+00, 1.95590795e-01], [ 1.00000000e+00, 2.90932588e+00, 4.54662939e-01], [ 1.00000000e+00, 2.34337239e+00, 1.71686194e-01], [ 1.00000000e+00, 1.01692573e+00, -4.91537136e-01], [ 1.00000000e+00, 3.49265880e-01, -8.25367060e-01], [ 1.00000000e+00, 1.94903574e+00, -2.54821292e-02], [ 1.00000000e+00, -1.03993561e+00, -1.51996780e+00], [ 1.00000000e+00, 4.65642653e-01, -7.67178674e-01], [ 1.00000000e+00, 9.89880968e-01, -5.05059516e-01], [ 1.00000000e+00, 2.15551741e+00, 7.77587053e-02], [ 1.00000000e+00, 1.81800904e+00, -9.09954775e-02], [ 1.00000000e+00, 5.78740105e-01, -7.10629947e-01], [ 1.00000000e+00, 1.30734770e+00, -3.46326151e-01], [ 1.00000000e+00, 3.18507972e+00, 5.92539858e-01], [ 1.00000000e+00, 2.50197831e+00, 2.50989156e-01], [ 1.00000000e+00, -1.45167784e-01, -1.07258389e+00], [ 1.00000000e+00, -2.42449095e-01, -1.12122455e+00], [ 1.00000000e+00, 2.87677921e+00, 4.38389607e-01], [ 1.00000000e+00, 3.98254834e+00, 9.91274171e-01], [ 1.00000000e+00, 2.35161575e+00, 1.75807873e-01], [ 1.00000000e+00, 2.46438420e+00, 2.32192102e-01], [ 1.00000000e+00, 7.70736933e-01, -6.14631533e-01], [ 1.00000000e+00, 1.90955118e+00, -4.52244102e-02], [ 1.00000000e+00, 1.29065910e+00, -3.54670452e-01], [ 1.00000000e+00, 1.26041117e+00, -3.69794415e-01], [ 1.00000000e+00, 2.26250133e-01, -8.86874933e-01], [ 1.00000000e+00, 2.33058399e+00, 1.65291996e-01], [ 1.00000000e+00, 1.70372000e+00, -1.48140001e-01], [ 1.00000000e+00, 1.21666149e+00, -3.91669255e-01], [ 1.00000000e+00, 2.20075770e+00, 1.00378848e-01], [ 1.00000000e+00, 1.07282043e+00, -4.63589787e-01], [ 1.00000000e+00, 2.53076536e+00, 2.65382681e-01], [ 1.00000000e+00, 1.84841750e+00, -7.57912502e-02], [ 1.00000000e+00, 3.51701107e+00, 7.58505537e-01], [ 1.00000000e+00, 4.02116854e+00, 1.01058427e+00]])
Recall our formula for our OLS coefficients:
$$\mathbf{\hat{\beta}} = \mathbf{(X^TX)^{-1}X^Ty}$$And let's especially take a look at the term $\mathbf{(X^TX)^{-1}}$
Here we need to find the inverse of the matrix multiplication of $\mathbf{X}$ times its transpose.
Recall that in order for an inverse to exist, then the determinant must be non-zero. let's take a look:
XTX = np.dot(X.T,X)
XTX
array([[100. , 197.80413845, -1.09793077], [197.80413845, 499.93708703, 52.16440506], [ -1.09793077, 52.16440506, 27.1801333 ]])
np.linalg.det(XTX)
-4.246897825393734e-10
Because of the way that a computer makes calculations (called floating point arithmetic), you can not get exact numbers. But if we look at the determinant above, we notice that it gets extremly close to zero. As a practical matter, the determinant is zero and the matrix $\mathbf{X^TX}$ is singular. That is to say, that an inverse does not exist.
Let's see what happens if we try to calculate the inverse anyways:
np.linalg.inv(np.dot(X.T,X))
array([[-2.55886342e+13, 1.27943171e+13, -2.55886342e+13], [ 1.27943171e+13, -6.39715856e+12, 1.27943171e+13], [-2.55886342e+13, 1.27943171e+13, -2.55886342e+13]])
Here we see that the calculated inverse ends up exploding to huge numbers. To try to add some intuition to what is happening, the term $\mathbf{X^TX}$ is used in calcuating the variance and covariance of the exogenous variables. This in term is used to calculate both coefficients and standard errors. The result is coefficients that are highly unreliable and very large standard errors.
We can caculate the coefficients below, putting the calculation in a function:
def calcCoefs(X, y):
betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
return(betas)
calcCoefs(X,y)
array([ 0.23564321, 2.4245653 , -3.51814661])
We can also try to calculate the standard errors.
def calcSE(X, y):
betas = np.dot((np.dot(np.linalg.inv(np.dot(X.T,X)),X.T)),y)
fitted = np.dot(X,betas)
resid = y-fitted
sigmasq_hat = np.var(resid)
vcov = sigmasq_hat*np.linalg.inv(np.dot(X.T,X))
return(np.sqrt(np.diag(vcov)))
calcSE(X, y)
/var/folders/fs/kr83xdyx2jq_8gww_xfmtxfr0000gn/T/ipykernel_37476/1324029103.py:7: RuntimeWarning: invalid value encountered in sqrt return(np.sqrt(np.diag(vcov)))
array([nan, nan, nan])
Here we see that we end up problems calculating the standard deviation.
Now let's do a simple change to our data generation. We add some random noise to the x2 variable. Now x1 and x2 are highly correlated, but there is no longer perfect multicolinearity
x1 = spt.norm.rvs(2, 1, 100)
x2 = -1 + .5*x1 + spt.norm.rvs(0,1, 100)
y = 5 + .8*x1 + spt.norm.rvs(0,1, 100)
X = np.array([np.ones(100), x1, x2])
X = X.T
calcCoefs(X, y)
array([ 4.81708036, 0.90988719, -0.055699 ])
calcSE(X,y)
array([0.24051707, 0.11395535, 0.09371561])
Now we manage to estimate a coefficient and standard error on the x1 variable that are reasonable.
Find another data set with an interesting question you could explore with an OLS regression.
Manually (with matrix algebra) calculate your beta coefficients as well the var-covariance matrix.
Construct 95% confidence intervalls for your beta coefficients. Are the coefficients statistically significant at a 95% level
Interpret your results, including with a visualisation
Show a plot of the fitted values versus the residuals. What does this tell you about the model?
Validate your results by using statsmodels
Describe (potential) problems with the regressions. Recall from your previous statistics courses that simple OLS is ubiased, consistent and efficent subject to several assumptions: Correct specification, strict exogeneity, X variables are linearly independent, homoscedasticity, and normality. Are these reasonable assumptions in your case? (The assumptions are placed in rough order of importance.)