Multiple Regression Analysis: Estimation

The first tool of Many

Fernando Rios-Avila

Why stay with 1 when you can use Many? … Why not?

The SLRM we cover last week is a powerful tool to understand the mechanics behind regression analysis, however is too limited.
- Use one control? to fix everything ?!
The Natural alternative is to relax the assumption and Make things more flexible.
- In other words…Allow for adding More controls

Thus, instead of:

\[y_i = \beta_0 + \beta_1 x_i + e_i \]

we have to consider:

\[y_i = \beta_0 + \beta_1 x_{1i} +\beta_2 x_{2i} + \dots + \beta_k x_{ki} + u_i \]

How many can we add? and why does it help?

The power of MLR: Why do more controls help?

One more explicitly accounts for variables that before were hidden in \(e_i\).
We add \(x_{2i},x_{3i},\dots,x_{ki}\) to the model model, and is no longer in \(e_i\)
Allows for richer model specifications and nonlinearities:

Before: \(y_i = \beta_0 + \beta_1 x_{1i} + e_i\)

Now : \(y_i = \beta_0 + \beta_1 x_{1i} +\beta_2 x^2_{1i} + \beta_3 x^{1/2}_{1i} + \beta_4 x^{-1}_{1i} + \beta_5 x_{2i}+\dots+e_i\)

Thus, we can get closer to the unknown Population function, and explicitly handle some endogeneity problems (we control for it).

With great power…

Being able to add more controls is good, but:

May make things worse (bad controls)
Or might not be feasible (small Sample)
Or may be difficult to interpret (unless you know how to)

Do assumptions change?

Not really, but lets make some math changes:

\[y=\begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} ; X=\begin{bmatrix}x_1' \\ x_2' \\ \vdots \\ x_n' \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} & \dots & x_{k1} \\ 1 & x_{12} & x_{22} & \dots & x_{k2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \dots & x_{kn} \end{bmatrix}; \beta =\begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{bmatrix}; e=\begin{bmatrix}e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix} \]

\[y=X\beta + e \]

Mostly the same

Linear in Parameters: \(y = X\beta + e\) (And this is the pop function)
Random Sampling from the population of interest. (So errors \(e_i\) is independent from \(e_j\))
No Perfect Collinearity:

This is the alternative to \(Var(x)>0\) (SLRM), and deserves more attention.

We want each variable in \(X\) to have some independent variation, from all other variables in the model.
- In the SLRM, the independent variation idea was with respect to the constant.
If a variable was a linear combination of others, then \(\beta's\) cannot be identified. You need to choose what to keep:

\[\begin{aligned} y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1+X_2) + e \\ &= \beta_0 + (\beta_1+\beta_3) X_1 + (\beta_2+\beta_3) X_2 + e \end{aligned} \]

Zero Conditional mean (Exogeneity): \(E(e_i|X)=0\)

Requires that the errors and the explanatory variables are uncorrelated. This is “easier” to achieve, because we can now move variables form the error to the model.

However, there could be things you can’t controls for (and remain lurking in your errors)

I call this the most important assumption, because is the hardest to deal with

If A1-A4 Hold, then your estimates will be unbiased!

Homoskedasticity Same as before. Errors dispersion does not change with respect to all \(X's\). \[Var(e|X)=c \]

Just as with SLRM, this assumption will help with the estimation of Standard Errors.

MLRM estimation

As before, not much has changed. We are still interested in finding \(\beta's\) that Minimizes the (squared) error of the model when compared to the observed data:

\[\hat \beta = \min_\beta \sum (y_i-X_i'\beta)^2 = \min_\beta \sum (y_i-\beta_0-\beta_1 x_{1i}-\dots-\beta_k x_{ki})^2 \]

The corresponding FOC generate \(K+1\) equations to identify \(K+1\) parameters:

\[\begin{aligned} \sum (y_i-X_i'\beta) &= 0 \\ \sum x_{1i}(y_i-X_i'\beta) &= 0 \\ \sum x_{2i}(y_i-X_i'\beta) &= 0 \\ \dots \\\ \sum x_{ki}(y_i-X_i'\beta) &= 0 \end{aligned} \rightarrow X'(y-X\beta) =0 \rightarrow \hat \beta = (X'X)^{-1}X'y \]

`mata` Interlute (for those curious)

Code

frause gpa1, clear
gen one =1 
mata: y=st_data(.,"colgpa"); mata: x=st_data(.,"hsgpa act one")
mata: xx=x'x ; ixx=invsym(xx) ; xy = x'y 
mata: b = ixx * xy ; b

                 1
    +---------------+
  1 |  .4534558853  |
  2 |  .0094260123  |
  3 |  1.286327767  |
    +---------------+

You got the \(\beta's\), how do you interpret them?

Interpretation of MLRM is similar to the SLRM. For most cases, you simply look into the coefficients, and interpret effects in terms of Changes:

\[\begin{aligned} y_i = \hat\beta_0 + \hat\beta_1 x_{1i} + \hat\beta_2 x_{2i} + e_i \\ \Delta y_i = \hat\beta_1 \Delta x_{1i} + \hat\beta_2 \Delta x_{2i} + \Delta e_i \end{aligned} \]

Under A1-A5 I can make use the above to make interpretations

\(\hat \beta_0\) has no effect on “changes” of \(y\). Only its levels.
\(\hat \beta_1\) indicates how much \(\Delta y_i\) will be if \(\Delta x_{1i}\) increases in 1 unit, if both \(\Delta x_{2i}\) and \(\Delta e_i\) remain constant (Ceteris Paribus)

\(\Delta e_i=0\) by assumption, and \(\Delta x_{2i}=0\) because we are explicitly controlling for it (We impute this based on extrapolations)

You could also analyze the effect of \(\Delta x_{1i}\) and \(\Delta x_{2i}\) Simultaneously!

Example

Code

qui: frause wage1, clear
qui: reg lwage educ exper tenure
local b0:display %5.3f _b[_cons]
local b1:display %5.3f _b[educ]
local b2:display %5.3f _b[exper]
local b3:display %5.3f _b[tenure]
display "\$log(wage) = `b0' + `b1' educ + `b2' exper + `b3' tenure$"

\(log(wage) = 0.284 + 0.092 educ + 0.004 exper + 0.022 tenure\)

\(\beta_0\) has no effect on changes, but level.
- If someone has no education, experience or tenure, log(wages) will be 0.284. Why not wages? and Does it make sense to assume 0 education, experience and tenure?
\(\beta_1\): An additional year of education increases wages in 0.092log points or about 9.2%, if Experience and tenure do not change (ceteris paribus).

Notes:

Think of Interpretations as counterfactual: \(y_{post} - y_{pre}\)
Assumption: Other factors (unobserved \(e\)) remain fixed (is it always credible??)
Effects can be combined. What if a person gains 1 year of education but losses 3 of tenure?

More on Interpretation

Under A1-A5, you can still interpret results as “counterfactual” at the individual level. However, its more common to do it based on Conditional means:

\[\frac {\Delta E(y|X)}{\Delta X_k} \simeq E(y|X_{-k},X_k+1)-E(y|X) \]

Which mostly changes Language.

The expected effect of an increase in \(X\) in one unit.

Alternative Interpretation: Partialling out

An alternative way of interpreting (and understanding) MLRM is to think about partialling out interpretation.
This interpretation is based on the Frisch-Waugh-Lowell Theorem, which states that the following models should give you the SAME \(\beta's\):

\[\begin{aligned} y &= \color{blue}{\beta_1 } X_1 + \beta_2 X_2 + e \\ (I-P_{X^c_2}) y &= \color{green}{\beta_1} (I-P_{X^c_2}) X_1 + e \\ P_{X^c_2} &= X^c_2 (X'^{c}_2 X^{c}_2) X'^{c}_2 : \text{Projection Matrix} \end{aligned} \]

Partialling out

\(\beta_1\) can be interpreted as the effect of \(X_1\) on \(y\), after all variation related to \(X_2\) has been “eliminated”.

Thus \(\beta_1\) is the effect uniquely driven by \(X_1\).

Example

qui {
  frause oaxaca, clear
  drop if lnwage==.
  reg lnwage educ exper tenure
  est sto m1
  reg educ        exper tenure
  predict r_educ , res
  reg lnwage      exper tenure
  predict r_lnwage , res
  reg r_lnwage r_educ
  est sto m2
  reg lnwage educ
  est sto m3
}
esttab m1 m2 m3, se


------------------------------------------------------------
                      (1)             (2)             (3)   
                   lnwage        r_lnwage          lnwage   
------------------------------------------------------------
educ               0.0870***                       0.0800***
                (0.00516)                       (0.00539)   

exper              0.0113***                                
                (0.00154)                                   

tenure            0.00837***                                
                (0.00188)                                   

r_educ                             0.0870***                
                                (0.00516)                   

_cons               2.140***     8.93e-10           2.434***
                 (0.0650)        (0.0124)        (0.0636)   
------------------------------------------------------------
N                    1434            1434            1434   
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Estimator Properties: Unbiased

Recall, the estimator of \(\beta's\) when you have multiple dependent variables:

\[\begin{aligned} 0 &: \hat \beta = (X'X)^{-1} X'y \\ A1 \text{ & } A2 &: \hat \beta = (X'X)^{-1} X'(X\beta + e) \\ 1 &: \hat \beta = (X'X)^{-1} X'X\beta + (X'X)^{-1} X'e \\ A3 &: det(X'X)\neq 0 \rightarrow (X'X)^{-1} \text{ exists} \\ 2 &: \hat \beta = \beta + (X'X)^{-1} X'e \\ A4 &: E(e|X)=0 \rightarrow E[(X'X)^{-1} X'e]=0 \\ 3 &: E(\hat\beta)= \beta \text{ unbiased} \end{aligned} \]

Estimator Properties: Variance under Homoskedasticity

Lets start with (2). \(\beta's\) are random functions of the errors. Thus its variance will depend on \(e\).

\[\begin{aligned} 1 &: \hat \beta = \beta + (X'X)^{-1} X'e \\ 2 &:\hat \beta - \beta = (X'X)^{-1} X'e \\ 3 &: Var(\hat \beta - \beta) = Var((X'X)^{-1} X'e) \\ 4 &: Var(\hat \beta - \beta) = (X'X)^{-1} X' Var(e) X (X'X)^{-1} \\ \end{aligned} \]

\(Var(e)\) considers variance and covariance of each \(e_i\) and its combinations.

By assumption A2, \(cov(e_i,e_j)=0\). And by assumption A5 \(Var(e_i)=Var(e_j)\).

\[\begin{aligned} Var(\hat \beta - \beta) &= (X'X)^{-1} X' \sigma_e^2 I X (X'X)^{-1} \\ Var(\hat \beta - \beta) &= \sigma_e^2 (X'X)^{-1} \\ Var(\hat \beta_j - \beta_j) &= \frac{\sigma_e^2}{SST_j (1-R^2_j)} \end{aligned} \]

But we do not know \(\sigma^2_e\). Thus, we also “estimate it”

\[\hat \sigma^2_e = \frac{\sum \hat e^2}{N-K-1} \]

Which is unbiased estimator for \(\sigma^2_e\) if A1-A5 hold.

\[\begin{aligned} Var(\hat \beta - \beta) &= \sigma_e^2 (X'X)^{-1} \\ Var(\hat \beta_j - \beta_j) &= \frac{\sigma_e^2}{SST_j (1-R^2_j)} \\\ & = \frac{\sigma_e^2}{(N-1)Var(X_j) (1-R^2_j)} = \frac{\sigma_e^2}{(N-1)Var(X_j)}VIF_j \end{aligned} \]

To consider:

\(Var(\beta)\) increases with \(\sigma_e^2\). More variation in the error, more variation of the coefficients.
\(Var(\beta)\) decreases with Sample size \(N\)
\(Var(\beta)\) also decreases with Variation in \(X\)
However, it increases if there is less unique variation (Multicolinearity problem and VIF)

Quick Note

\(R^2\) are the same as SLRM: How much of variation is explained by the model.
- Also \(R^2 = corr(y,\hat y)^2\)
The fitted line goes over the “mean” of all variables
MLRM Fits hyper-planes to the data
Regression through the origin still a bad idea
Also, under A1-A5 OLS is the Best Linear Unbiased Estimator (BLUE)

Using Many controls is Fun!

You can add more stuff for better fit (high R^2)
Making sure nothing remains in “\(e\)”
It would also allow you for very “flexible” models
But…(these things are not necessarily good)

Ignoring Variables

In the MLRM framework, its easier to see what happens when important variables are ignored.

\[\text{True: } y = b_0 + b_1 x_1 + b_2 x_2 + e \]

But instead you estimate the following :

\[\text{Estimated: }y = g_0 + g_1 x_1 + v \]

Unless stronger assumptions are imposed, \(g_1\) will be a biased estimate of \(b_1\).

\[\begin{aligned} \hat g_1 &= \frac{\sum \tilde x_1 \tilde y}{\sum \tilde x_1^2} = \frac{\sum \tilde x_1 (b_1 \tilde x_1 +\tilde b_2 \tilde x_2 + e) }{\sum \tilde x_1^2} \\ &= \frac{b_1 \sum \tilde x_1^2}{\sum \tilde x_1^2} + b_2 \frac{\sum \tilde x_1\tilde x_2}{\sum \tilde x_1^2} +\frac{\sum \tilde x_1 e}{\sum \tilde x_1^2} \\ &= b_1+b_2 \delta_1 +\frac{\sum \tilde x_1 e}{\sum \tilde x_1^2} \\ \end{aligned} \]

This implies that \(g_1\) is biased:

\[E(\hat g_1) = b_1+b_2 \delta_1 \]

Where \(\delta_1\) is the coefficient in \(x_2=\delta_0+\delta_1 x_1 + v\).

Implications:

Unless
- \(\delta_1\) is zero (\(x_1\) and \(x_2\) are linearly independent) or,
- \(b_2\) is zero (\(x_2\) was irrelevant)
ignoring \(x_2\) will generate biased (and inconsistent) estimates for \(b_1\).

In models with more controls, the direction of the biases will be harder to define, but similar rule’s of thumb can be used.

Adding irrelevant controls

Adding irrelevant controls will have no effect on bias and consistency.

if your model is:

\[y=b_0+b_1 x_1 +e \]

but you estimate:

\[y=g_0+g_1 x_1+g_2 x_2 +v \]

your model is still unbiased:

\[\begin{aligned} g &= (X'X)^{-1}X'(X \beta^+ + e) \\ \beta^+ &= [\beta \ ; 0] \\ g &= \beta^+ + (X'X)^{-1}X'e \rightarrow E(g) = \beta^+ \end{aligned} \]

Adding “bad” Controls

The worst case, yet hard to see, is when you add “bad” Controls, also known as Colliers.

For example:

Say you want to analyze the effect of education on wages, and you control for occupation. Will it create an unbiased estimate for education?
- No. Your education affects your occupation choice. So some of the effect of education will be “absorbed” by occupation.
Say you want to see the impact of health expenditure on health, and you control for “#visits to the doctor”
- This may also affect your estimates, as expenditure may change how many times you Visits are highly related.

In general, you want to avoid using “channels” as Controls.

Omitting relevant variables that are correlated to \(X's\)

We wont talk about this. It violates A4, and creates endogeneity

Omitting relevant variables that are uncorrelated to \(X's\)

Omitted variables will be in the error \(e\). Thus variance of coefficients will be larger

\[\begin{aligned} True: & y = b_0 + b_1 x_1 + b_2 x_2 + e \\ Estimated: & y = g_0 + g_1 x_1 + v \\ & Var(e)<Var(v) \rightarrow Var(b_1)<Var(g_1) \end{aligned} \]

Thus Adding controls in Randomized experiements is still a good idea!

Adding Irrelevant controls (related to X’s)

Coefficients are unbiased, and \(\sigma^2_e\) will also be unbiased.

However, you may increase Multicolinearity in the model increasing \(R_j^2\) and \(VIF_j\).

Variance of relevant coefficients will be larger.

\[\begin{aligned} True: & y = b_0 + b_1 x_1 + e \\ Estimated: & y = g_0 + g_1 x_1 + g_2 x_2 + v \\ & Var(b_1)<Var(g_1) \end{aligned} \]

Examples!

Prediction

You can use MLRM to obtain predictions of outcomes.
They will be subject to the model specification.
For prediction you do not need to worry about “endogeneity” as much. Just on Predictive power (how ??)

qui:frause oaxaca, clear
gen wage = exp(lnwage)
qui:reg wage educ female age agesq single married
predict wage_hat
list wage wage_hat educ female age agesq single married in 1/5

(213 missing values generated)
(option xb assumed; fitted values)

     +----------------------------------------------------------------------+
     |     wage   wage_hat   educ   female   age   agesq   single   married |
     |----------------------------------------------------------------------|
  1. | 41.80602   25.61872      9        1    37    1369        1         0 |
  2. | 36.63003   31.70813      9        0    62    3844        0         1 |
  3. | 23.54788   30.71257   10.5        1    40    1600        0         1 |
  4. | 29.76191    42.7976     12        0    55    3025        0         0 |
  5. | 44.95504   35.76914     12        0    36    1296        0         1 |
     +----------------------------------------------------------------------+

Efficient Market

We could use MLRM to test theories, like the Efficient Market Theory.
For housing, the Assessed price of a house should be all information needed to assess the price of the house. (other ammenities should not matter)

frause hprice1, clear
qui:reg price assess bdrms llotsize lsqrft colonial
model_display

price_hat = 206.645 + 1.007 assess + 11.404 bdrms + 1.363 llotsize - 38.335 lsqrft + 9.297 colonial

N= 88 R2=0.831

qui:reg lprice lassess bdrms llotsize lsqrft colonial
model_display

lprice_hat = 0.210 + 1.036 lassess + 0.025 bdrms + 0.008 llotsize - 0.092 lsqrft + 0.045 colonial

N= 88 R2=0.777

Testing for Discrimination (CP)

We could test for discrimination: Unexplained differences in outcomes once other factors are kept fixed.
It does require that groups are similar in terms of unobservables.

qui: frause oaxaca, clear
qui:reg lnwage female 
model_display

lnwage_hat = 3.440 - 0.173 female

N= 1434 R2=0.027

qui:reg lnwage female educ age agesq single married exper tenure
model_display

lnwage_hat = 0.383 - 0.160 female + 0.064 educ + 0.113 age - 0.001 agesq - 0.072 single - 0.094 married - 0.000 exper + 0.007 tenure

N= 1434 R2=0.345

Treatment Evaluation

Under Random Assingment SRM was enough to estimate ATTs.
But if assigment was conditionally random, a better approach would be using MLRM

frause jtrain98, clear
qui:reg earn98 train 
model_display

earn98_hat = 10.610 - 2.050 train

N= 1130 R2=0.016

qui:reg earn98 train earn96 educ age married
model_display

earn98_hat = 4.667 + 2.411 train + 0.373 earn96 + 0.363 educ - 0.181 age + 2.482 married

N= 1130 R2=0.405

Thats all for Today

Next Week: Inference and Asymptotics

Multiple Regression Analysis: Estimation

Why stay with 1 when you can use Many? … Why not?

The power of MLR: Why do more controls help?

Do assumptions change?

Mostly the same

If A1-A4 Hold, then your estimates will be unbiased!

MLRM estimation

mata Interlute (for those curious)

You got the \(\beta's\), how do you interpret them?

Example

More on Interpretation

Alternative Interpretation: Partialling out

Example

Estimator Properties: Unbiased

Estimator Properties: Variance under Homoskedasticity

Quick Note

Using Many controls is Fun!

Ignoring Variables

Adding irrelevant controls

Adding “bad” Controls

What about Standard Errors

Examples!

Prediction

Efficient Market

Testing for Discrimination (CP)

Treatment Evaluation

Thats all for Today

`mata` Interlute (for those curious)