The first tool of Many
Thus, instead of:
\[y_i = \beta_0 + \beta_1 x_i + e_i \]
we have to consider:
\[y_i = \beta_0 + \beta_1 x_{1i} +\beta_2 x_{2i} + \dots + \beta_k x_{ki} + u_i \]
How many can we add? and why does it help?
One more explicitly accounts for variables that before were hidden in \(e_i\).
We add \(x_{2i},x_{3i},\dots,x_{ki}\) to the model model, and is no longer in \(e_i\)
Allows for richer model specifications and nonlinearities:
Before: \(y_i = \beta_0 + \beta_1 x_{1i} + e_i\)
Now : \(y_i = \beta_0 + \beta_1 x_{1i} +\beta_2 x^2_{1i} + \beta_3 x^{1/2}_{1i} + \beta_4 x^{-1}_{1i} + \beta_5 x_{2i}+\dots+e_i\)
Thus, we can get closer to the unknown Population function, and explicitly handle some endogeneity problems (we control for it).
With great power…
Being able to add more controls is good, but:
Not really, but lets make some math changes:
\[y=\begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} ; X=\begin{bmatrix}x_1' \\ x_2' \\ \vdots \\ x_n' \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} & \dots & x_{k1} \\ 1 & x_{12} & x_{22} & \dots & x_{k2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \dots & x_{kn} \end{bmatrix}; \beta =\begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{bmatrix}; e=\begin{bmatrix}e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix} \]
\[y=X\beta + e \]
Linear in Parameters: \(y = X\beta + e\) (And this is the pop function)
Random Sampling from the population of interest. (So errors \(e_i\) is independent from \(e_j\))
No Perfect Collinearity:
This is the alternative to \(Var(x)>0\) (SLRM), and deserves more attention.
\[\begin{aligned} y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1+X_2) + e \\ &= \beta_0 + (\beta_1+\beta_3) X_1 + (\beta_2+\beta_3) X_2 + e \end{aligned} \]
Zero Conditional mean (Exogeneity): \(E(e_i|X)=0\)
Requires that the errors and the explanatory variables are uncorrelated. This is “easier” to achieve, because we can now move variables form the error to the model.
However, there could be things you can’t controls for (and remain lurking in your errors)
I call this the most important assumption, because is the hardest to deal with
Just as with SLRM, this assumption will help with the estimation of Standard Errors.
As before, not much has changed. We are still interested in finding \(\beta's\) that Minimizes the (squared) error of the model when compared to the observed data:
\[\hat \beta = \min_\beta \sum (y_i-X_i'\beta)^2 = \min_\beta \sum (y_i-\beta_0-\beta_1 x_{1i}-\dots-\beta_k x_{ki})^2 \]
The corresponding FOC generate \(K+1\) equations to identify \(K+1\) parameters:
\[\begin{aligned} \sum (y_i-X_i'\beta) &= 0 \\ \sum x_{1i}(y_i-X_i'\beta) &= 0 \\ \sum x_{2i}(y_i-X_i'\beta) &= 0 \\ \dots \\\ \sum x_{ki}(y_i-X_i'\beta) &= 0 \end{aligned} \rightarrow X'(y-X\beta) =0 \rightarrow \hat \beta = (X'X)^{-1}X'y \]
mata
Interlute (for those curious)Interpretation of MLRM is similar to the SLRM. For most cases, you simply look into the coefficients, and interpret effects in terms of Changes:
\[\begin{aligned} y_i = \hat\beta_0 + \hat\beta_1 x_{1i} + \hat\beta_2 x_{2i} + e_i \\ \Delta y_i = \hat\beta_1 \Delta x_{1i} + \hat\beta_2 \Delta x_{2i} + \Delta e_i \end{aligned} \]
Under A1-A5 I can make use the above to make interpretations
\(\Delta e_i=0\) by assumption, and \(\Delta x_{2i}=0\) because we are explicitly controlling for it (We impute this based on extrapolations)
You could also analyze the effect of \(\Delta x_{1i}\) and \(\Delta x_{2i}\) Simultaneously!
\(log(wage) = 0.284 + 0.092 educ + 0.004 exper + 0.022 tenure\)
Notes:
Think of Interpretations as counterfactual: \(y_{post} - y_{pre}\)
Assumption: Other factors (unobserved \(e\)) remain fixed (is it always credible??)
Effects can be combined. What if a person gains 1 year of education but losses 3 of tenure?
Under A1-A5, you can still interpret results as “counterfactual” at the individual level. However, its more common to do it based on Conditional means:
\[\frac {\Delta E(y|X)}{\Delta X_k} \simeq E(y|X_{-k},X_k+1)-E(y|X) \]
Which mostly changes Language.
The expected effect of an increase in \(X\) in one unit.
An alternative way of interpreting (and understanding) MLRM is to think about partialling out interpretation.
This interpretation is based on the Frisch-Waugh-Lowell Theorem, which states that the following models should give you the SAME \(\beta's\):
\[\begin{aligned} y &= \color{blue}{\beta_1 } X_1 + \beta_2 X_2 + e \\ (I-P_{X^c_2}) y &= \color{green}{\beta_1} (I-P_{X^c_2}) X_1 + e \\ P_{X^c_2} &= X^c_2 (X'^{c}_2 X^{c}_2) X'^{c}_2 : \text{Projection Matrix} \end{aligned} \]
Partialling out
\(\beta_1\) can be interpreted as the effect of \(X_1\) on \(y\), after all variation related to \(X_2\) has been “eliminated”.
Thus \(\beta_1\) is the effect uniquely driven by \(X_1\).
qui {
frause oaxaca, clear
drop if lnwage==.
reg lnwage educ exper tenure
est sto m1
reg educ exper tenure
predict r_educ , res
reg lnwage exper tenure
predict r_lnwage , res
reg r_lnwage r_educ
est sto m2
reg lnwage educ
est sto m3
}
esttab m1 m2 m3, se
------------------------------------------------------------
(1) (2) (3)
lnwage r_lnwage lnwage
------------------------------------------------------------
educ 0.0870*** 0.0800***
(0.00516) (0.00539)
exper 0.0113***
(0.00154)
tenure 0.00837***
(0.00188)
r_educ 0.0870***
(0.00516)
_cons 2.140*** 8.93e-10 2.434***
(0.0650) (0.0124) (0.0636)
------------------------------------------------------------
N 1434 1434 1434
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
Recall, the estimator of \(\beta's\) when you have multiple dependent variables:
\[\begin{aligned} 0 &: \hat \beta = (X'X)^{-1} X'y \\ A1 \text{ & } A2 &: \hat \beta = (X'X)^{-1} X'(X\beta + e) \\ 1 &: \hat \beta = (X'X)^{-1} X'X\beta + (X'X)^{-1} X'e \\ A3 &: det(X'X)\neq 0 \rightarrow (X'X)^{-1} \text{ exists} \\ 2 &: \hat \beta = \beta + (X'X)^{-1} X'e \\ A4 &: E(e|X)=0 \rightarrow E[(X'X)^{-1} X'e]=0 \\ 3 &: E(\hat\beta)= \beta \text{ unbiased} \end{aligned} \]
Lets start with (2). \(\beta's\) are random functions of the errors. Thus its variance will depend on \(e\).
\[\begin{aligned} 1 &: \hat \beta = \beta + (X'X)^{-1} X'e \\ 2 &:\hat \beta - \beta = (X'X)^{-1} X'e \\ 3 &: Var(\hat \beta - \beta) = Var((X'X)^{-1} X'e) \\ 4 &: Var(\hat \beta - \beta) = (X'X)^{-1} X' Var(e) X (X'X)^{-1} \\ \end{aligned} \]
\(Var(e)\) considers variance and covariance of each \(e_i\) and its combinations.
By assumption A2, \(cov(e_i,e_j)=0\). And by assumption A5 \(Var(e_i)=Var(e_j)\).
\[\begin{aligned} Var(\hat \beta - \beta) &= (X'X)^{-1} X' \sigma_e^2 I X (X'X)^{-1} \\ Var(\hat \beta - \beta) &= \sigma_e^2 (X'X)^{-1} \\ Var(\hat \beta_j - \beta_j) &= \frac{\sigma_e^2}{SST_j (1-R^2_j)} \end{aligned} \]
But we do not know \(\sigma^2_e\). Thus, we also “estimate it”
\[\hat \sigma^2_e = \frac{\sum \hat e^2}{N-K-1} \]
Which is unbiased estimator for \(\sigma^2_e\) if A1-A5 hold.
\[\begin{aligned} Var(\hat \beta - \beta) &= \sigma_e^2 (X'X)^{-1} \\ Var(\hat \beta_j - \beta_j) &= \frac{\sigma_e^2}{SST_j (1-R^2_j)} \\\ & = \frac{\sigma_e^2}{(N-1)Var(X_j) (1-R^2_j)} = \frac{\sigma_e^2}{(N-1)Var(X_j)}VIF_j \end{aligned} \]
To consider:
In the MLRM framework, its easier to see what happens when important variables are ignored.
\[\text{True: } y = b_0 + b_1 x_1 + b_2 x_2 + e \]
But instead you estimate the following :
\[\text{Estimated: }y = g_0 + g_1 x_1 + v \]
Unless stronger assumptions are imposed, \(g_1\) will be a biased estimate of \(b_1\).
\[\begin{aligned} \hat g_1 &= \frac{\sum \tilde x_1 \tilde y}{\sum \tilde x_1^2} = \frac{\sum \tilde x_1 (b_1 \tilde x_1 +\tilde b_2 \tilde x_2 + e) }{\sum \tilde x_1^2} \\ &= \frac{b_1 \sum \tilde x_1^2}{\sum \tilde x_1^2} + b_2 \frac{\sum \tilde x_1\tilde x_2}{\sum \tilde x_1^2} +\frac{\sum \tilde x_1 e}{\sum \tilde x_1^2} \\ &= b_1+b_2 \delta_1 +\frac{\sum \tilde x_1 e}{\sum \tilde x_1^2} \\ \end{aligned} \]
This implies that \(g_1\) is biased:
\[E(\hat g_1) = b_1+b_2 \delta_1 \]
Where \(\delta_1\) is the coefficient in \(x_2=\delta_0+\delta_1 x_1 + v\).
Implications:
Unless
ignoring \(x_2\) will generate biased (and inconsistent) estimates for \(b_1\).
In models with more controls, the direction of the biases will be harder to define, but similar rule’s of thumb can be used.
Adding irrelevant controls will have no effect on bias and consistency.
if your model is:
\[y=b_0+b_1 x_1 +e \]
but you estimate:
\[y=g_0+g_1 x_1+g_2 x_2 +v \]
your model is still unbiased:
\[\begin{aligned} g &= (X'X)^{-1}X'(X \beta^+ + e) \\ \beta^+ &= [\beta \ ; 0] \\ g &= \beta^+ + (X'X)^{-1}X'e \rightarrow E(g) = \beta^+ \end{aligned} \]
The worst case, yet hard to see, is when you add “bad” Controls, also known as Colliers.
For example:
In general, you want to avoid using “channels” as Controls.
Omitting relevant variables that are correlated to \(X's\)
We wont talk about this. It violates A4, and creates endogeneity
Omitting relevant variables that are uncorrelated to \(X's\)
\[\begin{aligned} True: & y = b_0 + b_1 x_1 + b_2 x_2 + e \\ Estimated: & y = g_0 + g_1 x_1 + v \\ & Var(e)<Var(v) \rightarrow Var(b_1)<Var(g_1) \end{aligned} \]
Thus Adding controls in Randomized experiements is still a good idea!
Adding Irrelevant controls (related to X’s)
Coefficients are unbiased, and \(\sigma^2_e\) will also be unbiased.
However, you may increase Multicolinearity in the model increasing \(R_j^2\) and \(VIF_j\).
Variance of relevant coefficients will be larger.
\[\begin{aligned} True: & y = b_0 + b_1 x_1 + e \\ Estimated: & y = g_0 + g_1 x_1 + g_2 x_2 + v \\ & Var(b_1)<Var(g_1) \end{aligned} \]
You can use MLRM to obtain predictions of outcomes.
They will be subject to the model specification.
For prediction you do not need to worry about “endogeneity” as much. Just on Predictive power (how ??)
qui:frause oaxaca, clear
gen wage = exp(lnwage)
qui:reg wage educ female age agesq single married
predict wage_hat
list wage wage_hat educ female age agesq single married in 1/5
(213 missing values generated)
(option xb assumed; fitted values)
+----------------------------------------------------------------------+
| wage wage_hat educ female age agesq single married |
|----------------------------------------------------------------------|
1. | 41.80602 25.61872 9 1 37 1369 1 0 |
2. | 36.63003 31.70813 9 0 62 3844 0 1 |
3. | 23.54788 30.71257 10.5 1 40 1600 0 1 |
4. | 29.76191 42.7976 12 0 55 3025 0 0 |
5. | 44.95504 35.76914 12 0 36 1296 0 1 |
+----------------------------------------------------------------------+
We could use MLRM to test theories, like the Efficient Market Theory.
For housing, the Assessed price of a house should be all information needed to assess the price of the house. (other ammenities should not matter)
price_hat = 206.645 + 1.007 assess + 11.404 bdrms + 1.363 llotsize - 38.335 lsqrft + 9.297 colonial
N= 88 R2=0.831
lprice_hat = 0.210 + 1.036 lassess + 0.025 bdrms + 0.008 llotsize - 0.092 lsqrft + 0.045 colonial
N= 88 R2=0.777
We could test for discrimination: Unexplained differences in outcomes once other factors are kept fixed.
It does require that groups are similar in terms of unobservables.
lnwage_hat = 3.440 - 0.173 female
N= 1434 R2=0.027
lnwage_hat = 0.383 - 0.160 female + 0.064 educ + 0.113 age - 0.001 agesq - 0.072 single - 0.094 married - 0.000 exper + 0.007 tenure
N= 1434 R2=0.345
earn98_hat = 10.610 - 2.050 train
N= 1130 R2=0.016
earn98_hat = 4.667 + 2.411 train + 0.373 earn96 + 0.363 educ - 0.181 age + 2.482 married
N= 1130 R2=0.405
Next Week: Inference and Asymptotics