When A5 Fails
Mathematically: \[Var(e|x=c_1)\neq Var(e|x=c_2)\]
This means: the conditional variance of the errors is not constant across control characteristics.
What happens when you have heteroskedastic errors?
In terms of \(\beta's\) and \(R^2\) and \(R^2_{adj}\), nothing. Coefficients and Goodness of fit are still unbiased and consistent.
But, Coefficients standard errors are based on the simplifying assumption of normality. Thus Variances will be bias!.
Setup:
\(y = e\) where \(e \sim N(0,\sigma_e^2h(x))\)
\(x = uniform(-1,1)\)
/*capture program drop sim_het
program sim_het, eclass
clear
set obs 500
gen x = runiform(-1,1)
gen u = rnormal()
** Homoskedastic
gen y_1 = u*2
** increasing first, decreasing later
gen y_4 = u*sqrt(9*abs(x))
replace x = x-2
reg y_1 x
matrix b=_b[x],_b[x]/_se[x]
reg y_4 x
matrix b=b,_b[x],_b[x]/_se[x]
matrix coleq b = h0 h0 h3 h3
matrix colname b = b t b t
ereturn post b
end
qui:simulate , reps(1000) dots(100):sim_het
save mdata/simulate.dta, replace*/
use mdata/simulate.dta, replace
two (kdensity h0_b_t) (kdensity h3_b_t) ///
(function y = normalden(x), range(-4 4) lw(2) color(gs5%50)), ///
legend(order(3 "Normal" 1 "With Homoskedasticty" 2 "with Heteroskedasticity"))
graph export images/fig6_1.png, replace height(1000)
So, If errors are heteroskedastic, then all statistics (t-stats, F-stats, chi2’s) are wrong.
But, there are solutions…many solutions
Some of them are more involved than others.
But before trying to do that, lets first ask…do we have a problem?
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\beta_3 x_3 +e \]
\[Var(e|x)=f(x_1,x_2,\dots,x_k) \sim a_0+a_1x_1 + a_2 x_2 + \dots + a_k x_k+v\]
\[Var(e|x)=f(x_1,x_2,\dots,x_k) \sim a_0+a_1x_1 + a_2 x_2 + \dots + a_k x_k+v\]
With this the Null hypothesis is: \[H_0: a_1 = a_2 = \dots = a_k=0 \text{ vs } H_1: H_0 \text{ is false} \]
Easy enough, but do we KNOW \(Var(e|x)\) ? can we model the equation?
We don’t!.
But we can use \(\hat e^2\) instead. The assumption is that \(\hat e^2\) is a good enough approximation for the condional variance \(Var(e|x)\).
With this, the test for heteroskedasticty can be implemented using the following recipe.
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat e^2 & = \gamma_0 + \gamma_1 x_1 +\gamma_2 x_2 +\gamma_3 x_3 + v \\ H_0 &: \gamma_1=\gamma_2=\gamma_3=0 \\ F &= \frac{R^2_{\hat e^2}/k}{(1-R^2_{\hat e^2})/(n-k-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(k) \leftarrow BP-test \end{aligned} \]
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat e^2 & = \gamma_0 + \sum \gamma_{1,k} x_k + \sum \gamma_{2,k} x_k^2 + \sum_k \sum_{j\neq k} \gamma_{3,j,k} x_j x_k + v \\ H_0 &: \text{ All } \gamma's =0 \\ F &= \frac{R^2_{\hat e^2}/q}{(1-R^2_{\hat e^2})/(n-q-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(q) \end{aligned} \]
\(q\) is the total number of coefficients in the model (not counting the intercept.)
\[\begin{aligned} \text{Model}: y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e \\ \hat e & = y - (\hat \beta_0 + \hat\beta_1 x_1 +\hat \beta_2 x_2 +\hat \beta_3 x_3) \end{aligned} \]
\[\begin{aligned} \hat y &= y - \hat e \\ \hat e^2 & = \gamma_0 + \gamma_1 \hat y + \gamma_2 \hat y^2 + \dots + v \\ H_0 &: \gamma_1 = \gamma_2 = \dots =0 \\ F &= \frac{R^2_{\hat e^2}/ h }{(1-R^2_{\hat e^2})/(n-h-1)} \\ LM &=N R^2_{\hat e^2} \sim \chi^2(h) \end{aligned} \]
\(h\) is the total number of coefficients in the model (not counting the intercept.)
Housing prices:
\[\begin{aligned} price &= \beta_0 + \beta_1 lotsize + \beta_2 sqft + \beta_3 bdrms + e_1 \\ log(price) &= \beta_0 + \beta_1 log(lotsize) + \beta_2 log(sqft) + \beta_3 bdrms + e_2 \\ \end{aligned} \]
frause hprice1, clear
reg price lotsize sqrft bdrms
predict res, res
predict price_hat
gen res2=res^2
display "BP-test"
reg res2 lotsize sqrft bdrms, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "White Test"
reg res2 c.(lotsize sqrft bdrms)##c.(lotsize sqrft bdrms), notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "MWhite Test"
reg res2 price_hat c.price_hat#c.price_hat, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 57.46
Model | 617130.701 3 205710.234 Prob > F = 0.0000
Residual | 300723.805 84 3580.0453 R-squared = 0.6724
-------------+---------------------------------- Adj R-squared = 0.6607
Total | 917854.506 87 10550.0518 Root MSE = 59.833
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lotsize | .0020677 .0006421 3.22 0.002 .0007908 .0033446
sqrft | .1227782 .0132374 9.28 0.000 .0964541 .1491022
bdrms | 13.85252 9.010145 1.54 0.128 -4.065141 31.77018
_cons | -21.77031 29.47504 -0.74 0.462 -80.38466 36.84405
------------------------------------------------------------------------------
(option xb assumed; fitted values)
BP-test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 5.34
Model | 701213780 3 233737927 Prob > F = 0.0020
Residual | 3.6775e+09 84 43780003.5 R-squared = 0.1601
-------------+---------------------------------- Adj R-squared = 0.1301
Total | 4.3787e+09 87 50330276.7 Root MSE = 6616.6
nR^2: 14.092386
p(chi2) 0.003
White Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(9, 78) = 5.39
Model | 1.6784e+09 9 186492378 Prob > F = 0.0000
Residual | 2.7003e+09 78 34619265 R-squared = 0.3833
-------------+---------------------------------- Adj R-squared = 0.3122
Total | 4.3787e+09 87 50330276.7 Root MSE = 5883.8
nR^2: 33.731659
p(chi2) 0.000
MWhite Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(2, 85) = 9.64
Model | 809489395 2 404744697 Prob > F = 0.0002
Residual | 3.5692e+09 85 41991113.9 R-squared = 0.1849
-------------+---------------------------------- Adj R-squared = 0.1657
Total | 4.3787e+09 87 50330276.7 Root MSE = 6480.1
nR^2: 16.268416
p(chi2) 0.000
frause hprice1, clear
reg lprice llotsize lsqrft bdrms
predict res, res
predict price_hat
gen res2=res^2
display "BP-test"
reg res2 llotsize lsqrft bdrms, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "White Test"
reg res2 c.(llotsize lsqrft bdrms)##c.(llotsize lsqrft bdrms), notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
display "MWhite Test"
reg res2 price_hat c.price_hat#c.price_hat, notable
display "nR^2: " e(N)*e(r2)
display "p(chi2) " %5.3f chi2tail(e(df_m),e(N)*e(r2))
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 50.42
Model | 5.15504028 3 1.71834676 Prob > F = 0.0000
Residual | 2.86256324 84 .034078134 R-squared = 0.6430
-------------+---------------------------------- Adj R-squared = 0.6302
Total | 8.01760352 87 .092156362 Root MSE = .1846
------------------------------------------------------------------------------
lprice | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
llotsize | .1679667 .0382812 4.39 0.000 .0918404 .244093
lsqrft | .7002324 .0928652 7.54 0.000 .5155597 .8849051
bdrms | .0369584 .0275313 1.34 0.183 -.0177906 .0917074
_cons | -1.297042 .6512836 -1.99 0.050 -2.592191 -.001893
------------------------------------------------------------------------------
(option xb assumed; fitted values)
BP-test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(3, 84) = 1.41
Model | .022620168 3 .007540056 Prob > F = 0.2451
Residual | .448717194 84 .005341871 R-squared = 0.0480
-------------+---------------------------------- Adj R-squared = 0.0140
Total | .471337362 87 .005417671 Root MSE = .07309
nR^2: 4.2232484
p(chi2) 0.238
White Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(9, 78) = 1.05
Model | .051147864 9 .005683096 Prob > F = 0.4053
Residual | .420189497 78 .005387045 R-squared = 0.1085
-------------+---------------------------------- Adj R-squared = 0.0057
Total | .471337362 87 .005417671 Root MSE = .0734
nR^2: 9.5494489
p(chi2) 0.388
MWhite Test
Source | SS df MS Number of obs = 88
-------------+---------------------------------- F(2, 85) = 1.73
Model | .018464046 2 .009232023 Prob > F = 0.1830
Residual | .452873315 85 .005327921 R-squared = 0.0392
-------------+---------------------------------- Adj R-squared = 0.0166
Total | .471337362 87 .005417671 Root MSE = .07299
nR^2: 3.4472889
p(chi2) 0.178
Can you do this in Stata
? Yes, estat hettest
. But look into the options. There are many more options in that command.
We need to fix!
So lets learn to Fix it first
\[Var(e|X)=h(x)\sigma^2_e \]
\[\begin{aligned} y &= b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 +e \\ Var(e|x) &=x_1 \sigma_e^2 || h(x)=x_1 \end{aligned} \]
You can correct Heteroskedasticity in two ways:
The new error is Homoskedastic (but has no constant)!
Same solution as before, and there is no need to “transform” data, or keep track of a constant.
This is often called WLS (weighted least squares) or GLS (Generalized Least Squares).
\[\begin{aligned} Var(e|x) &= \sigma^2 \exp(\delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots) \exp v \\ \hat e^2 &= \sigma^2 \exp(\delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots) \exp v \\ log(\hat e^2) &= \delta_0 + \delta_1 x_1 + \delta_2 x_2 +\dots+ v \\ log(\hat e^2) &= \delta_0 + \delta_1 \hat y + \delta_2 \hat y^2 + \dots+ v \\ \widehat{\log h(x)} &= \hat \delta_0 + \hat \delta_1 x_1 + \hat \delta_2 x_2 + \dots = x \hat \delta \\ \hat h(x) &= \exp (x \hat \delta) \text{ or } \hat h(x)=e^{x \hat \delta} \end{aligned} \]
Recall “Long” variance formula:
\[Var(\beta)=\color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{Var(e|X)}\color{green}{X}\color{brown}{(X'X)^{-1}} \]
\[\begin{aligned} Var_{gls/fgls}(\beta)&=\sigma^2_{\tilde e} \color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{ \Omega_h(x) }\color{green}{X}\color{brown}{(X'X)^{-1}} \\ \sigma^2_{\tilde e} &= \frac{1}{N-k-1} \sum \frac{\hat e^2}{h(x)} \\ \Omega_h(x) [i,j] &= h(x_i) & \text{ if } i=j \\ & = 0 & \text{ if } i\neq j \\ \end{aligned} \]
Let me present to you, the Sandwitch Formula: \[Var(\beta)=c \color{brown}{(X'X)^{-1}}\color{green}{X}'\color{red}{\Omega}\color{green}{X}\color{brown}{(X'X)^{-1}} \]
\[\begin{aligned} \Omega [i,j] &= \hat e_i^2 & \text{ if } i=j \\ & = 0 & \text{ if } i\neq j \\ \end{aligned} \]
The best approximation to conditional variance is equal to \(\hat e_i^2\). (plus assuming no correlation)
Valid in large samples, but can be really bad in smaller ones.
There are other versions. See HC0 HC1 HC2 HC3.
Instead…use the long formula
\[\begin{aligned} H_0: & R_{q,k+1}\beta_{k+1,1}=c_{q,1} \\ \Sigma_R &= R_{q,k+1} V^r_\beta R'_{q,k+1} \\ F-stat &= \frac 1 q (R\beta-c)' \Sigma_R^{-1} (R\beta-c) \end{aligned} \]
Prediction SE:
For Prediction with Logs
\[\hat y_i = \exp \left( \widehat{log y_i}+\hat \sigma_{\tilde e}^2 \hat h_i /2 \right) \]
frause smoke, clear
gen age_40sq=(age-40)^2
** Default
qui:reg cigs lincome lcigpric educ age age_40sq restaurn
est sto m1
predict cig_hat
predict cig_res,res
** GLS: h(x)=lincome Weighted
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/lincome]
est sto m2
** FGLS: h(x) = f(cigs_hat)
gen lcres=log(cig_res^2)
qui:reg lcres c.cig_hat##c.cig_hat##c.cig_hat
predict aux
gen hx=exp(aux)
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/hx]
est sto m3
qui:reg cigs lincome lcigpric educ age age_40sq restaurn , robust
est sto m4
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/lincome], robust
est sto m5
qui:reg cigs lincome lcigpric educ age age_40sq restaurn [aw=1/hx], robust
est sto m6
set linesize 255
------------------------------------------------------------------------------------------------------------
default GLS FGLS Rob GLS-Rob FGLS-Rob
b/se/p b/se/p b/se/p b/se/p b/se/p b/se/p
------------------------------------------------------------------------------------------------------------
lincome 0.880 0.926 1.005** 0.880 0.926* 1.005
(0.728) (0.672) (0.422) (0.596) (0.559) (0.651)
[0.227] [0.169] [0.017] [0.140] [0.098] [0.123]
lcigpric -0.751 -1.525 -4.572 -0.751 -1.525 -4.572
(5.773) (5.696) (4.260) (6.035) (6.067) (9.651)
[0.897] [0.789] [0.284] [0.901] [0.802] [0.636]
educ -0.501*** -0.477*** -0.610*** -0.501*** -0.477*** -0.610***
(0.167) (0.166) (0.115) (0.162) (0.159) (0.115)
[0.003] [0.004] [0.000] [0.002] [0.003] [0.000]
age 0.049 0.048 0.041 0.049* 0.048 0.041
(0.034) (0.033) (0.026) (0.030) (0.029) (0.032)
[0.146] [0.147] [0.108] [0.099] [0.101] [0.197]
age_40sq -0.009*** -0.009*** -0.007*** -0.009*** -0.009*** -0.007***
(0.002) (0.002) (0.001) (0.001) (0.001) (0.002)
[0.000] [0.000] [0.000] [0.000] [0.000] [0.001]
restaurn -2.825** -2.776** -3.383*** -2.825*** -2.776*** -3.383***
(1.112) (1.108) (0.722) (1.008) (0.992) (0.696)
[0.011] [0.012] [0.000] [0.005] [0.005] [0.000]
_cons 10.797 13.184 25.712 10.797 13.184 25.712
(24.145) (23.656) (17.120) (25.401) (25.478) (41.539)
[0.655] [0.577] [0.134] [0.671] [0.605] [0.536]
------------------------------------------------------------------------------------------------------------
N 807 807 807 807 807 807
------------------------------------------------------------------------------------------------------------
frause gpa1, clear
** LPM
gen parcoll = (fathcoll | mothcoll)
reg pc hsgpa act parcoll
predict res_1, res
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 1.98
Model | 1.40186813 3 .467289377 Prob > F = 0.1201
Residual | 32.3569971 137 .236182461 R-squared = 0.0415
-------------+---------------------------------- Adj R-squared = 0.0205
Total | 33.7588652 140 .241134752 Root MSE = .48599
------------------------------------------------------------------------------
pc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
hsgpa | .0653943 .1372576 0.48 0.635 -.2060231 .3368118
act | .0005645 .0154967 0.04 0.971 -.0300792 .0312082
parcoll | .2210541 .092957 2.38 0.019 .037238 .4048702
_cons | -.0004322 .4905358 -0.00 0.999 -.970433 .9695686
------------------------------------------------------------------------------
(option xb assumed; fitted values)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
pchat | 141 .3971631 .1000667 .1700624 .4974409
hx | 141 .2294822 .0309768 .1411412 .2499934
(analytic weights assumed)
(sum of wgt is 628.1830743667746)
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 2.22
Model | 1.54663033 3 .515543445 Prob > F = 0.0882
Residual | 31.7573194 137 .231805251 R-squared = 0.0464
-------------+---------------------------------- Adj R-squared = 0.0256
Total | 33.3039497 140 .237885355 Root MSE = .48146
------------------------------------------------------------------------------
pc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
hsgpa | .0327029 .1298817 0.25 0.802 -.2241292 .289535
act | .004272 .0154527 0.28 0.783 -.0262847 .0348286
parcoll | .2151862 .0862918 2.49 0.014 .04455 .3858224
_cons | .0262099 .4766498 0.05 0.956 -.9163323 .9687521
------------------------------------------------------------------------------
** Testing for Heteroskedasticity
replace res_1 = res_1^2
replace res_2 = res_2^2/hx
display "Default"
reg res_1 hsgpa act parcoll, notable
display "Weighted"
reg res_2 hsgpa act parcoll, notable
(141 real changes made)
(141 real changes made)
Default
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 2.82
Model | .133163365 3 .044387788 Prob > F = 0.0412
Residual | 2.15497574 137 .01572975 R-squared = 0.0582
-------------+---------------------------------- Adj R-squared = 0.0376
Total | 2.28813911 140 .016343851 Root MSE = .12542
Weighted
Source | SS df MS Number of obs = 141
-------------+---------------------------------- F(3, 137) = 0.63
Model | .874194807 3 .291398269 Prob > F = 0.5980
Residual | 63.5472068 137 .463848225 R-squared = 0.0136
-------------+---------------------------------- Adj R-squared = -0.0080
Total | 64.4214016 140 .460152868 Root MSE = .68106
Next Week: Problems of Specification