When A5 Fails
Mathematically: \[Var(e|x=c_1)\neq Var(e|x=c_2)\]
This means: the conditional variance of the errors is not constant across control characteristics.
What happens when you have heteroskedastic errors?
In terms of \(\beta's\) and \(R^2\) and \(R^2_{adj}\), nothing. Coefficients and Goodness of fit are still unbiased and consistent.
But, Coefficients standard errors are based on the simplifying assumption of normality. Thus Variances will be bias!.
Setup:
\(y = e\) where \(e \sim N(0,\sigma_e^2h(x))\)
\(x = uniform(-1,1)\)
/*capture program drop sim_het
program sim_het, eclass
clear
set obs 500
gen x = runiform(-1,1)
gen u = rnormal()
** Homoskedastic
gen y_1 = u*2
** increasing first, decreasing later
gen y_4 = u*sqrt(9*abs(x))
replace x = x-2
reg y_1 x
matrix b=_b[x],_b[x]/_se[x]
reg y_4 x
matrix b=b,_b[x],_b[x]/_se[x]
matrix coleq b = h0 h0 h3 h3
matrix colname b = b t b t
ereturn post b
end
qui:simulate , reps(1000) dots(100):sim_het
save mdata/simulate.dta, replace*/
use mdata/simulate.dta, replace
two (kdensity h0_b_t) (kdensity h3_b_t) ///
(function y = normalden(x), range(-4 4) lw(2) color(gs5%50)), ///
legend(order(3 "Normal" 1 "With Homoskedasticty" 2 "with Heteroskedasticity"))
graph export images/fig6_1.png, replace height(1000)So, If errors are heteroskedastic, then all statistics (t-stats, F-stats, chi2’s) are wrong.
But, there are solutions…many solutions
Some of them are more involved than others.
But before trying to do that, lets first ask…do we have a problem?
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\beta_3 x_3 +e \]
\[Var(e|x)=f(x_1,x_2,\dots,x_k) \sim a_0+a_1x_1 + a_2 x_2 + \dots + a_k x_k+v\]
\[Var(e|x)=f(x_1,x_2,\dots,x_k) \sim a_0+a_1x_1 + a_2 x_2 + \dots + a_k x_k+v\]
With this the Null hypothesis is: \[H_0: a_1 = a_2 = \dots = a_k=0 \text{ vs } H_1: H_0 \text{ is false} \]
Easy enough, but do we KNOW \(Var(e|x)\) ? can we model the equation?
We don’t!.
But we can use \(\hat e^2\) instead. The assumption is that \(\hat e^2\) is a good enough approximation for the condional variance \(Var(e|x)\).
With this, the test for heteroskedasticty can be implemented using the following recipe.
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat e^2 & = \gamma_0 + \gamma_1 x_1 +\gamma_2 x_2 +\gamma_3 x_3 + v \\ H_0 &: \gamma_1=\gamma_2=\gamma_3=0 \\ F &= \frac{R^2_{\hat e^2}/k}{(1-R^2_{\hat e^2})/(n-k-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(k) \leftarrow BP-test \end{aligned} \]
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat e^2 & = \gamma_0 + \sum \gamma_{1,k} x_k + \sum \gamma_{2,k} x_k^2 + \sum_k \sum_{j\neq k} \gamma_{3,j,k} x_j x_k + v \\ H_0 &: \text{ All } \gamma's =0 \\ F &= \frac{R^2_{\hat e^2}/q}{(1-R^2_{\hat e^2})/(n-q-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(q) \end{aligned} \]
\(q\) is the total number of coefficients in the model (not counting the intercept.)
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat y &= y - \hat e \\ \hat e^2 & = \gamma_0 + \gamma_1 \hat y + \gamma_2 \hat y^2 + \dots + v \\ H_0 &: \gamma_1 = \gamma_2 = \dots =0 \\ F &= \frac{R^2_{\hat e^2}/ h }{(1-R^2_{\hat e^2})/(n-h-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(h) \end{aligned} \]
\(h\) is the total number of coefficients in the model (not counting the intercept.)
Housing prices:
\[\begin{aligned} price &= \beta_0 + \beta_1 lotsize + \beta_2 sqft + \beta_3 bdrms + e_1 \\ log(price) &= \beta_0 + \beta_1 log(lotsize) + \beta_2 log(sqft) + \beta_3 bdrms + e_2 \\ \end{aligned} \]
frause hprice1, clear
reg price lotsize sqrft bdrms
predict res, res
predict price_hat
gen res2=res^2
display "BP-test"
reg res2 lotsize sqrft bdrms, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "White Test"
reg res2 c.(lotsize sqrft bdrms)##c.(lotsize sqrft bdrms), notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "MWhite Test"
reg res2 price_hat c.price_hat#c.price_hat, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 57.46
Model | 617130.701 3 205710.234 Prob > F = 0.0000
Residual | 300723.805 84 3580.0453 R-squared = 0.6724
-------------+---------------------------------- Adj R-squared = 0.6607
Total | 917854.506 87 10550.0518 Root MSE = 59.833
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lotsize | .0020677 .0006421 3.22 0.002 .0007908 .0033446
sqrft | .1227782 .0132374 9.28 0.000 .0964541 .1491022
bdrms | 13.85252 9.010145 1.54 0.128 -4.065141 31.77018
_cons | -21.77031 29.47504 -0.74 0.462 -80.38466 36.84405
------------------------------------------------------------------------------
(option xb assumed; fitted values)
BP-test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 5.34
Model | 701213780 3 233737927 Prob > F = 0.0020
Residual | 3.6775e+09 84 43780003.5 R-squared = 0.1601
-------------+---------------------------------- Adj R-squared = 0.1301
Total | 4.3787e+09 87 50330276.7 Root MSE = 6616.6
nR^2: 14.092386
p(chi2) 0.003
White Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(9, 78) = 5.39
Model | 1.6784e+09 9 186492378 Prob > F = 0.0000
Residual | 2.7003e+09 78 34619265 R-squared = 0.3833
-------------+---------------------------------- Adj R-squared = 0.3122
Total | 4.3787e+09 87 50330276.7 Root MSE = 5883.8
nR^2: 33.731659
p(chi2) 0.000
MWhite Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(2, 85) = 9.64
Model | 809489395 2 404744697 Prob > F = 0.0002
Residual | 3.5692e+09 85 41991113.9 R-squared = 0.1849
-------------+---------------------------------- Adj R-squared = 0.1657
Total | 4.3787e+09 87 50330276.7 Root MSE = 6480.1
nR^2: 16.268416
p(chi2) 0.000
frause hprice1, clear
reg lprice llotsize lsqrft bdrms
predict res, res
predict price_hat
gen res2=res^2
display "BP-test"
reg res2 llotsize lsqrft bdrms, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "White Test"
reg res2 c.(llotsize lsqrft bdrms)##c.(llotsize lsqrft bdrms), notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "MWhite Test"
reg res2 price_hat c.price_hat#c.price_hat, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 50.42
Model | 5.15504028 3 1.71834676 Prob > F = 0.0000
Residual | 2.86256324 84 .034078134 R-squared = 0.6430
-------------+---------------------------------- Adj R-squared = 0.6302
Total | 8.01760352 87 .092156362 Root MSE = .1846
------------------------------------------------------------------------------
lprice | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
llotsize | .1679667 .0382812 4.39 0.000 .0918404 .244093
lsqrft | .7002324 .0928652 7.54 0.000 .5155597 .8849051
bdrms | .0369584 .0275313 1.34 0.183 -.0177906 .0917074
_cons | -1.297042 .6512836 -1.99 0.050 -2.592191 -.001893
------------------------------------------------------------------------------
(option xb assumed; fitted values)
BP-test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 1.41
Model | .022620168 3 .007540056 Prob > F = 0.2451
Residual | .448717194 84 .005341871 R-squared = 0.0480
-------------+---------------------------------- Adj R-squared = 0.0140
Total | .471337362 87 .005417671 Root MSE = .07309
nR^2: 4.2232484
p(chi2) 0.238
White Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(9, 78) = 1.05
Model | .051147864 9 .005683096 Prob > F = 0.4053
Residual | .420189497 78 .005387045 R-squared = 0.1085
-------------+---------------------------------- Adj R-squared = 0.0057
Total | .471337362 87 .005417671 Root MSE = .0734
nR^2: 9.5494489
p(chi2) 0.388
MWhite Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(2, 85) = 1.73
Model | .018464046 2 .009232023 Prob > F = 0.1830
Residual | .452873315 85 .005327921 R-squared = 0.0392
-------------+---------------------------------- Adj R-squared = 0.0166
Total | .471337362 87 .005417671 Root MSE = .07299
nR^2: 3.4472889
p(chi2) 0.178
Can you do this in Stata? Yes, estat hettest. But look into the options. There are many more options in that command.
We need to fix!
So lets learn to Fix it first
\[Var(e|X)=h(x)\sigma^2_e \]
\[\begin{aligned} y &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 +e \\ Var(e|x) &=x_1 \sigma_e^2 || h(x)=x_1 \end{aligned} \]
You can correct Heteroskedasticity in two ways:
The new error is Homoskedastic (but has no constant)!
Same solution as before, and there is no need to “transform” data, or keep track of a constant.
This is often called WLS (weighted least squares) or GLS (Generalized Least Squares).
\[\begin{aligned} Var(e|x) &= \sigma^2 \exp(\delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots) \exp v \\ \hat e^2 &= \sigma^2 \exp(\delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots) \exp v \\ log(\hat e^2) &= \delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots+ v \\ log(\hat e^2) &= \delta_0 + \delta_1 \hat y + \delta_2 \hat y^2 + \dots+ v \\ \widehat{\log h(x)} &= \hat \delta_0 + \hat \delta_1 x_1 + \hat \delta_2 x_2 + \dots = x \hat \delta \\ \hat h(x) &= \exp (x \hat \delta) \text{ or } \hat h(x)=e^{x \hat \delta} \end{aligned} \]
Recall “Long” variance formula:
\[Var(\beta)=\color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{Var(e|X)}\color{green}{X}\color{brown}{(X'X)^{-1}} \]
\[\begin{aligned} Var_{gls/fgls}(\beta)&=\sigma^2_{\tilde e} \color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{ \Omega_h(x) }\color{green}{X}\color{brown}{(X'X)^{-1}} \\ \sigma^2_{\tilde e} &= \frac{1}{N-k-1} \sum \frac{\hat e^2}{h(x)} \\ \Omega_h(x) [i,j] &= h(x_i) & \text{ if } i=j \\ & = 0 & \text{ if } i\neq j \\ \end{aligned} \]
Let me present to you, the Sandwitch Formula: \[Var(\beta)=c \color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{\Omega}\color{green}{X}\color{brown}{(X'X)^{-1}} \]
\[\begin{aligned} \Omega [i,j] &= \hat e_i^2 & \text{ if } i=j \\ & = 0 & \text{ if } i\neq j \\ \end{aligned} \]
The best approximation to conditional variance is equal to \(\hat e_i^2\). (plus assuming no correlation)
Valid in large samples, but can be really bad in smaller ones.
There are other versions. See HC0 HC1 HC2 HC3.
Instead…use the long formula
\[\begin{aligned} H_0: & R_{q,k+1}\beta_{k+1,1}=c_{q,1} \\ \Sigma_R &= R_{q,k+1} V^r_\beta R'_{q,k+1} \\ F-stat &= \frac 1 q (R\beta-c)' \Sigma_R^{-1} (R\beta-c) \end{aligned} \]
Prediction SE:
For Prediction with Logs
\[\hat y_i = \exp \left( \widehat{log y_i}+\hat \sigma_{\tilde e}^2 \hat h_i /2 \right) \]
frause smoke, clear
gen age_40sq=(age-40)^2
** Default
qui:reg cigs lincome lcigpric educ age age_40sq restaurn
est sto m1
predict cig_hat
predict cig_res,res
** GLS: h(x)=lincome Weighted
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/lincome]
est sto m2
** FGLS: h(x) = f(cigs_hat)
gen lcres=log(cig_res^2)
qui:reg lcres c.cig_hat##c.cig_hat##c.cig_hat
predict aux
gen hx=exp(aux)
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/hx]
est sto m3
qui:reg cigs lincome lcigpric educ age age_40sq restaurn , robust
est sto m4
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/lincome], robust
est sto m5
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/hx], robust
est sto m6
set linesize 255
------------------------------------------------------------------------------------------------------------
default GLS FGLS Rob GLS-Rob FGLS-Rob
b/se/p b/se/p b/se/p b/se/p b/se/p b/se/p
------------------------------------------------------------------------------------------------------------
lincome 0.880 0.926 1.005** 0.880 0.926* 1.005
(0.728) (0.672) (0.422) (0.596) (0.559) (0.651)
[0.227] [0.169] [0.017] [0.140] [0.098] [0.123]
lcigpric -0.751 -1.525 -4.572 -0.751 -1.525 -4.572
(5.773) (5.696) (4.260) (6.035) (6.067) (9.651)
[0.897] [0.789] [0.284] [0.901] [0.802] [0.636]
educ -0.501*** -0.477*** -0.610*** -0.501*** -0.477*** -0.610***
(0.167) (0.166) (0.115) (0.162) (0.159) (0.115)
[0.003] [0.004] [0.000] [0.002] [0.003] [0.000]
age 0.049 0.048 0.041 0.049* 0.048 0.041
(0.034) (0.033) (0.026) (0.030) (0.029) (0.032)
[0.146] [0.147] [0.108] [0.099] [0.101] [0.197]
age_40sq -0.009*** -0.009*** -0.007*** -0.009*** -0.009*** -0.007***
(0.002) (0.002) (0.001) (0.001) (0.001) (0.002)
[0.000] [0.000] [0.000] [0.000] [0.000] [0.001]
restaurn -2.825** -2.776** -3.383*** -2.825*** -2.776*** -3.383***
(1.112) (1.108) (0.722) (1.008) (0.992) (0.696)
[0.011] [0.012] [0.000] [0.005] [0.005] [0.000]
_cons 10.797 13.184 25.712 10.797 13.184 25.712
(24.145) (23.656) (17.120) (25.401) (25.478) (41.539)
[0.655] [0.577] [0.134] [0.671] [0.605] [0.536]
------------------------------------------------------------------------------------------------------------
N 807 807 807 807 807 807
------------------------------------------------------------------------------------------------------------
frause gpa1, clear
** LPM
gen parcoll = (fathcoll | mothcoll)
reg pc hsgpa act parcoll
predict res_1, res
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 1.98
Model | 1.40186813 3 .467289377 Prob > F = 0.1201
Residual | 32.3569971 137 .236182461 R-squared = 0.0415
-------------+---------------------------------- Adj R-squared = 0.0205
Total | 33.7588652 140 .241134752 Root MSE = .48599
------------------------------------------------------------------------------
pc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
hsgpa | .0653943 .1372576 0.48 0.635 -.2060231 .3368118
act | .0005645 .0154967 0.04 0.971 -.0300792 .0312082
parcoll | .2210541 .092957 2.38 0.019 .037238 .4048702
_cons | -.0004322 .4905358 -0.00 0.999 -.970433 .9695686
------------------------------------------------------------------------------
(option xb assumed; fitted values)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
pchat | 141 .3971631 .1000667 .1700624 .4974409
hx | 141 .2294822 .0309768 .1411412 .2499934
(analytic weights assumed)
(sum of wgt is 628.1830743667746)
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 2.22
Model | 1.54663033 3 .515543445 Prob > F = 0.0882
Residual | 31.7573194 137 .231805251 R-squared = 0.0464
-------------+---------------------------------- Adj R-squared = 0.0256
Total | 33.3039497 140 .237885355 Root MSE = .48146
------------------------------------------------------------------------------
pc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
hsgpa | .0327029 .1298817 0.25 0.802 -.2241292 .289535
act | .004272 .0154527 0.28 0.783 -.0262847 .0348286
parcoll | .2151862 .0862918 2.49 0.014 .04455 .3858224
_cons | .0262099 .4766498 0.05 0.956 -.9163323 .9687521
------------------------------------------------------------------------------
** Testing for Heteroskedasticity
replace res_1 = res_1^2
replace res_2 = res_2^2/hx
display "Default"
reg res_1 hsgpa act parcoll, notable
display "Weighted"
reg res_2 hsgpa act parcoll, notable(141 real changes made)
(141 real changes made)
Default
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 2.82
Model | .133163365 3 .044387788 Prob > F = 0.0412
Residual | 2.15497574 137 .01572975 R-squared = 0.0582
-------------+---------------------------------- Adj R-squared = 0.0376
Total | 2.28813911 140 .016343851 Root MSE = .12542
Weighted
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 0.63
Model | .874194807 3 .291398269 Prob > F = 0.5980
Residual | 63.5472068 137 .463848225 R-squared = 0.0136
-------------+---------------------------------- Adj R-squared = -0.0080
Total | 64.4214016 140 .460152868 Root MSE = .68106
Next Week: Problems of Specification