import numpy as np
from numpy.linalg import norm
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from matplotlib import cm
from scipy.interpolate import interp2d
from scipy.linalg import inv, solve, det, eig
fig, ax = plt.subplots(figsize=(10, 8))
# Set the axes through the origin
for spine in ['left', 'bottom']:
ax.spines[spine].set_position('zero')
for spine in ['right', 'top']:
ax.spines[spine].set_color('none')
ax.set(xlim=(0, 5), ylim=(-5, 5))
ax.grid()
vecs = ((3, 2), (1, -2), (4,0))
for v in vecs:
ax.annotate('', xy=v, xytext=(0, 0),
arrowprops=dict(facecolor='blue',
shrink=0,
alpha=0.7,
width=0.5))
ax.text(1.1 * v[0], 1.1 * v[1], str(v))
plt.show()
fig, ax = plt.subplots()
plt.plot([0,3], [0,2], color="blue")
plt.plot([0,1], [0,-2], color="blue")
plt.plot([0,4], [0,0], color="blue")
plt.plot([3,4], [2,0], color="orange")
plt.text(3,2, "a")
plt.text(1,-2, "b")
plt.text(4,0, "c")
plt.text(3.5,1, "a+b")
plt.show()
Below is a system of equations
$$x_1 + 2x_3 + 3x_3 = 80$$$$x_1 + 2x_2 + 2x_4 = 100$$$$x_1 + 2x_3 + 2x_4 = 75$$$$2x_1 + x_2 + x_3 + 3x_4=100$$Put this system in matrix form and solve numerically.
A = np.array([[1, 0, 2, 3], [1, 2, 0, 2], [1,0,2,2], [2,1,1,0]])
b = np.array([80, 100, 75, 100])
A
array([[1, 0, 2, 3], [1, 2, 0, 2], [1, 0, 2, 2], [2, 1, 1, 0]])
b
array([ 80, 100, 75, 100])
x = np.linalg.solve(A, b)
print(x)
[22.5 33.75 21.25 5. ]
In an OLS regression, the predicted values (or "fitted" values) from a regression model will be orthogonal to the residuals of the model.
If we simulate data as below for a model $y=\alpha + \beta_1 x + \beta_2 z + e$, where $e\sim N(0,1)$
import numpy as np
#initiate a random number generator
rng = np.random.default_rng()
#generate 100 random number from a N(3,2) distribution
x = rng.normal(3, 2, 10000)
zeta = .5
z = zeta*x + rng.normal(0, 1, 10000)
e = rng.normal(0, 1, 10000)
beta1 = .3
beta2 = .2
alpha = 2
y = alpha + beta1*x + beta2*z + e
Now we can use statsmodels to estimate an incorrectly specified model with only x in the ols regression model
#import sm
import statsmodels.api as sm
mod1 = sm.OLS(y, sm.add_constant(x)).fit()
#the "add_constant()" command simply says we also want to estimate an intercept term.
#we save the residuals and fitted values for later use
resids1 = mod1.resid
fitted1 = mod1.fittedvalues
mod1.summary()
Dep. Variable: | y | R-squared: | 0.376 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.376 |
Method: | Least Squares | F-statistic: | 6022. |
Date: | Fri, 14 Apr 2023 | Prob (F-statistic): | 0.00 |
Time: | 15:35:35 | Log-Likelihood: | -14396. |
No. Observations: | 10000 | AIC: | 2.880e+04 |
Df Residuals: | 9998 | BIC: | 2.881e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.0245 | 0.018 | 110.761 | 0.000 | 1.989 | 2.060 |
x1 | 0.3934 | 0.005 | 77.604 | 0.000 | 0.384 | 0.403 |
Omnibus: | 0.211 | Durbin-Watson: | 1.970 |
---|---|---|---|
Prob(Omnibus): | 0.900 | Jarque-Bera (JB): | 0.210 |
Skew: | -0.011 | Prob(JB): | 0.900 |
Kurtosis: | 3.000 | Cond. No. | 6.80 |
Here we see the result we know from taking a multivariate statistics course. If we have a z variable that is not included in the model and which is correlated with both the x variable and the y variable then we get a biased estimate of the coefficient on the x variable. Here we get an estimate of .4, where the true value was really .3.
Estimate the correctly specified model with both x and z as explanatory variables (remember to include a constant term). Save the residual and fitted values in objects. Show that in both the correctly and incorrectly specified models, the vectors consisting of the residuals and fitted values are orthogonal to each other. Can you provide intuition for this?
X = np.column_stack((x, z))
mod2 = sm.OLS(y, sm.add_constant(X)).fit()
resids2 = mod2.resid
fitted2 = mod2.fittedvalues
mod2.summary()
Dep. Variable: | y | R-squared: | 0.402 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.402 |
Method: | Least Squares | F-statistic: | 3358. |
Date: | Fri, 14 Apr 2023 | Prob (F-statistic): | 0.00 |
Time: | 15:35:37 | Log-Likelihood: | -14185. |
No. Observations: | 10000 | AIC: | 2.838e+04 |
Df Residuals: | 9997 | BIC: | 2.840e+04 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.0271 | 0.018 | 113.266 | 0.000 | 1.992 | 2.062 |
x1 | 0.2901 | 0.007 | 41.322 | 0.000 | 0.276 | 0.304 |
x2 | 0.2073 | 0.010 | 20.801 | 0.000 | 0.188 | 0.227 |
Omnibus: | 0.137 | Durbin-Watson: | 1.974 |
---|---|---|---|
Prob(Omnibus): | 0.934 | Jarque-Bera (JB): | 0.117 |
Skew: | -0.005 | Prob(JB): | 0.943 |
Kurtosis: | 3.013 | Cond. No. | 7.63 |
print(np.dot(resids1, fitted1))
print(np.dot(resids2, fitted2))
-7.176481631177012e-12 -3.1533886613033246e-11
By matrix multiplying, we can see that both these sets of vectors are orthogonal (the numbers presented just show the floating-point error, basically the computers inability to hit exactly zero). The orthogonality of the residuals and fitted values are almost the definition of OLS - this is where sum of squared residuals is minimized.