Significance and Missing Data
Statistical significance is a way of determining if the observed effect is due to chance.
To do this, we required assumptions about the distribution under the Null Hypothesis.
Just couple of caveats:
Thus, finding no significance could mean the effect is noise or that the sample size is too small.
Recall Hypothesis can be true or not. We only know if the data is consistent with the hypothesis or not.
The power of a statistical test represents the probability of detecting an effect, given that the effect is real.
For example, say a drug has the effect of reducing the duration of the common cold by 1 day. But you do not know this
Instead you make your hypothesis. How likely is that the effect is significant?
What is happening?
In General, Setting high significance levels will reduce the power of the test.
Cartoon from xkcd, by Randall Munroe
Consider the following:
So you run the same experiment 100 times - How many times do you expect to find a mean greater than 0.196? - What are the chances of finding a mean greater than 0.196 at least once?
So if you run the experiment enough times, you are almost certain to find a “significant” effect.
There are various ways to control for multiple hypothesis testing.
SIDAK = \(\alpha_{adj} = (1-\alpha_{tg})^{1/n}\)
BONFERRONI = \(\alpha_{adj} = \alpha_{tg}/n\)
HOLM = \(\alpha_{adj,i} = \alpha_{tg}/(n-i+1)\)
Where \(\alpha_{tg}\) is the target \(\alpha\) level (e.g., 0.05), and \(n\) is the number of tests, and \(\alpha_{i,adj}\) is the adjusted \(\alpha\) level.
There is also Uniform Confidence Intervals (see here)
Consider the following example: \[\begin{aligned} \text{Pop}&: y = x\beta+ \epsilon \\ \text{Miss Mech }&: p(nmiss|x) = F(x\gamma) \\ \text{Miss Reg }&: m\times y = m\times x \beta + m\times \epsilon \end{aligned} \]
Where \(m\) is an indicator of missingness, and \(F\) is the function of missing.
Define the Weights as \(w = \frac{1}{1-p(nmiss|x)}\)
Then, we could use WLS to estimate the model of interest:
\[w \times m\times y = w \times m\times x \beta + w \times m\times \epsilon\]
frause oaxaca, clear
drop if lnwage ==.
** Modeling Missing
reg lnwage c.(educ exper tenure female age)## c.(educ exper tenure female age)
predict lxb
qui:sum lxb
replace lxb = normal((lxb -r(mean))/r(sd))
gen lnwage2 = lnwage if lxb <runiform()
gen dwage = lnwage2!=.
** Modeling Missing data
logit dwage educ exper tenure female age
predict prw, pr
gen wgt = 1/prw
** Estimating the model
reg lnwage educ exper tenure female age
reg lnwage2 educ exper tenure female age
reg lnwage2 educ exper tenure female age [w=wgt]
** Repeat the process 1000 times
Consider the case of a single variable \(Z\) with missing values, and assume we have a model for \(Z\):
\[Z = X\beta + \epsilon \]
\[\hat{Z} = \bar{Z} = \frac{1}{n}\sum_{i=1}^n Z_i\]
\[\hat{Z} = X\hat{\beta}\]
Neither is a good idea, even under MCAR. (Why?)
A better approach of imputation is to use a model to predict not only the “known” variation (Conditional mean), but also the “unknown” variation (Conditional variance).
So, we can use the model to predict the missing values, but we add some noise to the prediction.
\[\tilde z = X\hat{\beta} + \hat \epsilon\]
\[z = X\beta + \epsilon \]
\[ \begin{pmatrix} \hat{\beta} \\ \hat{\sigma}^2 \end{pmatrix} \sim N\left(\begin{bmatrix} \beta \\ \sigma \end{bmatrix}, \begin{bmatrix} V_{\beta} & 0 \\ 0 & V_{\sigma} \end{bmatrix}\right) \]
The previous methods assumed you only need one imputation to solve the Imputation problem.
One, however, may not be enough to account for the uncertainty in the imputation process.
So, we can repeat the imputation process multiple times, and obtain multiple imputed values for each missing data.
With multiple imputed values, we can estimate the model of interest multiple times, and then combine the results using Rubin’s rules.
\[\beta_{MI} = \frac{1}{M}\sum_{m=1}^M \beta_m\]
\[V_{MI} = \frac{1}{M}\sum_{m=1}^M V_m + \left(\frac{M+1}{M}\right)Var(\beta_m)\]
\[df = (M-1) \left( 1 + \frac{M}{M+1}\frac{Var_m}{Var_B}\right)^2\]
Stata
frause oaxaca, clear
drop if lnwage ==.
** Modeling Missing
foreach i in educ exper tenure age {
gen m_`i' = `i' if runiform()>.25
}
(Excerpt from the Swiss Labor Market Survey 1998)
(213 observations deleted)
(362 missing values generated)
(311 missing values generated)
(387 missing values generated)
(348 missing values generated)
Setting data for -mi
- commands
Conditional models:
m_exper: regress m_exper m_age m_educ m_tenure lnwage single
female
m_age: regress m_age m_exper m_educ m_tenure lnwage single
female
m_educ: regress m_educ m_exper m_age m_tenure lnwage single
female
m_tenure: regress m_tenure m_exper m_age m_educ lnwage single
female
Performing chained iterations ...
Multivariate imputation Imputations = 10
Chained equations added = 10
Imputed: m=1 through m=10 updated = 0
Initialization: monotone Iterations = 100
burn-in = 10
m_educ: linear regression
m_exper: linear regression
m_tenure: linear regression
m_age: linear regression
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
m_educ | 1072 362 362 | 1434
m_exper | 1123 311 311 | 1434
m_tenure | 1047 387 387 | 1434
m_age | 1086 348 348 | 1434
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
of the number of filled-in observations.)
Estmating the model(s):
mi estimate, post: regress lnwage m_* single female
est sto m1
regress lnwage educ exper tenure age single female
Multiple-imputation estimates Imputations = 10
Linear regression Number of obs = 1,434
Average RVI = 0.3607
Largest FMI = 0.5511
Complete DF = 1427
DF adjustment: Small sample DF: min = 31.27
avg = 293.52
max = 1,277.02
Model F test: Equal FMI F( 6, 403.1) = 63.74
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
lnwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
m_educ | .0780927 .0062465 12.50 0.000 .0656678 .0905176
m_exper | .0034867 .002416 1.44 0.154 -.0013503 .0083237
m_tenure | .0023203 .002857 0.81 0.423 -.0035046 .0081451
m_age | .0099868 .0023905 4.18 0.000 .0052212 .0147524
single | -.09733 .0309067 -3.15 0.002 -.1580571 -.036603
female | -.1226849 .0255255 -4.81 0.000 -.1727613 -.0726084
_cons | 2.102173 .1059333 19.84 0.000 1.889132 2.315213
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 1,434
-------------+---------------------------------- F(6, 1427) = 86.35
Model | 107.645555 6 17.9409258 Prob > F = 0.0000
Residual | 296.474249 1,427 .207760511 R-squared = 0.2664
-------------+---------------------------------- Adj R-squared = 0.2633
Total | 404.119804 1,433 .282009633 Root MSE = .45581
------------------------------------------------------------------------------
lnwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0753085 .005253 14.34 0.000 .0650041 .0856129
exper | .0026545 .0018941 1.40 0.161 -.001061 .0063701
tenure | .0022932 .0019725 1.16 0.245 -.0015761 .0061625
age | .0111437 .0019397 5.75 0.000 .0073388 .0149486
single | -.0918932 .0292993 -3.14 0.002 -.1493674 -.0344189
female | -.1289592 .0253754 -5.08 0.000 -.1787363 -.0791822
_cons | 2.100201 .0816356 25.73 0.000 1.940063 2.26034
------------------------------------------------------------------------------
The method sketched above (OLS) can also be extended to other models
Consider Logit models
S1: Estimate Logit model: \(P(y=1|X) = F(X\beta)\)
S2: Draw \(\beta\) from the distribution, call it \(\tilde\beta\)
S3: Draw \(y\) from a Bernoulli distribution: \(y \sim Bernoulli(F(X\tilde\beta))\)
Donors are identified as observations with similar characteristics to the one with missing data. (close to the missing observation)
Finding potential donors is easy when there is low dimensional data, but it becomes more difficult as the number of variables increases.
Imputing data for wages in oaxaca
.
frause oaxaca, clear
** ID pool of donors based on age and gender
egen id_pool = group(age female)
** Now, for each missing observation select a "random" donor
gen misswage =missing(lnwage)
bysort id_pool (misswage):egen smp = sum(misswage==0)
bysort id_pool: gen draw = runiformint(1, smp)
bysort id_pool: replace lnwage = lnwage[draw] if misswage==1
sum lnwage if misswage==0
sum lnwage if misswage==1
sum lnwage
(Excerpt from the Swiss Labor Market Survey 1998)
(213 real changes made)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnwage | 1,434 3.357604 .5310458 .507681 5.259097
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnwage | 213 3.342183 .5482005 .507681 5.259097
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lnwage | 1,647 3.35561 .5331507 .507681 5.259097
With many variables, we often estimating some distance measure and/or data reduction to ID “close” observations:
\[D(X,X_0) = abs(G(X) - G(X_0))\]
\[D(X,X_0) = \sqrt{(X-X_0)'\Sigma^{-1}(X-X_0)}\]
\[D(X,X_0) = \frac{1}{K}\sum_{i=1}^K \alpha_i f\left(\frac{X_i - X_{0i}}{h}\right)\]
Once distances are estimated, donor pools can be defined based on the distance measure.
And the donor can be selected randomly from the pool. (or weighted by distance)
This approaches could be vary computationally intensive, because it requires estimating \(N\times N\) distances.
Complexity may be reduced by using data reduction techniques, such as PCA, FA or propensity scores
Stata
Using single Score (Data reduction)
frause oaxaca, clear
drop if lnwage ==.
// 25% of data is missing
gen mlnwage = lnwage if runiform()>.25
gen misswage =missing(mlnwage)
qui:logit misswage educ age agesq female single married
predict psc, xb
qui:reg mlnwage educ age agesq female single married
predict lnwh, xb
qui:pca educ age agesq female single married
qui:predict pc1, score
(Excerpt from the Swiss Labor Market Survey 1998)
(213 observations deleted)
(351 missing values generated)
For imputation, lets do something simple, Use data from the closet observation (with lower score) as donor.
foreach i in psc lnwh pc1 {
drop2 lnwage_`i'
gen lnwage_`i' = mlnwage
sort `i'
replace lnwage_`i'=lnwage_`i'[_n-1] if lnwage_`i'==. & lnwage_`i'[_n-1]!=.
*replace lnwage_`i'=lnwage_`i'[_n+1] if lnwage_`i'==. & lnwage_`i'[_n+1]!=.
qui:_regress lnwage_`i' educ age agesq female single married
matrix b`i' = e(b)
matrix coleq b`i'=`i'
}
variable lnwage_psc not found
(351 missing values generated)
(351 real changes made)
variable lnwage_lnwh not found
(351 missing values generated)
(351 real changes made)
variable lnwage_pc1 not found
(351 missing values generated)
(351 real changes made)
Estimate models 1000 times, and lets see results
What happens when not some but ALL data is missing?