Research Methods II

Session 6: Imputation: Statistical Matching

Fernando Rios-Avila

Missing Data vs MISSING DATA

Missing Data

As we described before, missing data is a problem for micro data analysis.
- Reduces sample size, statistical power, and may bias estimates. (depending on the type of missingness)
We have also discussed that there are few ways to deal with missing data.
- Complete case analysis
- Reweighting
- Imputation: Prediction
- Imputation: Hotdecking
This methods allows you solve for missing data if data is MCAR or MAR.
- with MNAR, dealing with missing data is difficult
Nevertheless, you can deal with Missing data, because you have some observed data that can be used to impute it.

Types of Missing data

MISSING DATA

What would happen if all data is missing?
Example:
- You are working with the CPS, but are interested in looking at the relationship between income and time use.
  - CPS does NOT have time use data.
We are going for the -lion hunt-
- You can’t impute time use data
- You can’t use complete case analysis
- You can’t use reweighting
- You can’t use hotdecking
- what do we do?

What do we do?

One option would be using a different data set.
- In the US, the American Time Use Survey (ATUS) could be a good option.
- But…The data has no income information!
What if we could combine the two data sets?
This changes the problem from Missing all data, to one of Missing Data by design.
- Some segment of the population was asked about income, and some other segment was asked about time use.

Imputation and Statistical Matching

If you consider the idea of combining two data sets, you can treat the problem as one of imputation.
- You have a sample (two) that represents the population of interest.
- We can reasonably assume the data is MCAR. But the data of interest is not observed at the same time.
- Then, we can use the combine data to impute the missing data, using many of the approaches we have discussed before.
And there is also another method that is more commonly used (at Levy) to deal with this problem.
- Statistical Matching (aka Data Fusion).
What does this imply?:
- Match individuals across datasets (“Donor” and “Recipient”)
- Transfer information based on the matching links

Official examples:

There is a lot of work on this topic. Many statistical agencies use this approach to combine data Survey data with administrative data.

Administrative data is usually more accurate, but it is not collected for the purpose of research.
Survey data is collected for research purposes, but may not have accurate data in some areas (income)
Unless Survey Data was collected with the purpose of being linked with administrative data, one requires methods similar to statistical matching to combine both data sets.

In house Some examples:

At Levy we have used this approach to produce relevant datasets:
- LIMEW: Levy Institute Measure of Economic Well-Being Combines Time use, wealtgh and Survey data (in addition to other aggregate data)
- LIMTIP: Levy Institute Measure of Time and Income Poverty Combines ATUS, with income/consumption data

Framework

What do we need?

Consider two data sets: \(A\) and \(B\).
\(A\) has informtion on \(X\) and \(Z\)
\(B\) has information on \(Y\) and \(Z\)
We want a file that has \(X\), \(Y\) and \(Z\).

Assumptions

(\(X,Y,Z\)) are multivariate random variables with joint distribution \(f(x,y,z)\), that represents the population of interest.
Both datasets are random samples from the same population of interest.

\(\frac{P_w(D=A|X,Y,Z)}{P_w(D=B|X,Y,Z)} = \frac{P(D=A)}{P(D=B)} = 1\)
Conditional Independence assumption:
- \(Y\) and \(Z\) are independent from each other given \(X\). \[f(x,y|z) = f(x|z)f(y|z)\]
The goal is to combine the two data sets to produce a file that has data on \(X\), \(Y\) and \(Z\). by identifying \(f(x,y,z)\).

Statistical Matching: Limitations

The quality of this identification will depend on how well the conditional independence assumption holds.
Because of this, synthetic datasets can’t tell you much about covariances or causal relationships

\[Cov(z,y,z) = \begin{pmatrix} V(X) & \color{red}{V(X,Y)} & V(X,Z) \\ \color{red}{V(X,Y)'} & V(Y) & V(Y,Z) \\ V(X,Z)' & V(Y,Z)' & V(Z) \end{pmatrix} \]

albeit, you can impose certain bounderies on the covariance matrix.

Matching Approaches:

There are two types of statistical matching procedures:

Unconstrained Matching
Constraints Matching

Records from \(A\) and \(B\) can be used multiple times (or none) in the matching.
- Absurd case: One observation from \(A\) is matched with all observations from \(B\).
This is the most common approach in the literature for policy evaluation

Pros: Uses the “best” candidate for the matching. Cons: It may not transfer the uncoditional distribution of the data.
Does not necessarily required \(A\) and \(B\) to be from the same population. (weighted size)

All records from \(A\) and \(B\) are used once and only once in the matching. (without replacement)
When using weighted samples, records are matched until the weights are exhausted.
- Requires \(A\) and \(B\) to be from the same population. (weighted size)
Pros: It transfers the unconditional distribution of the data. Cons: My not use “best” candidate for the matching.

Matching Records:

Matching records, requires defining a measure of similarity between records.
This measures can vary depending on the data type, and dimensionality of the data

\[\begin{aligned} \text{ Euclidian: } d(r^A,r^b) &= \sqrt{\sum_i^k(x^A_i-x^B_i)^2 } \\ \text{ SdEuclidian: } d(r^A,r^b) &= \sqrt{\sum_i^k\left(\frac{x^A_i-x^B_i}{\sigma_j}\right)^2 } \\ \text{ Mahalanobis: } d(r^A,r^b) &= \sqrt{(x^A-x^B)'\Sigma_x^{-1}(x^A-x^B)} \\ \end{aligned} \]

All this measures are useful when one has high dimensional data.

Matching: Reducing Dimensionality

A second alternative is to reduce data dimensionality before estimating distances.

Predictive mean matching:
Propensity score matching:
PCA

Model \(x = z\beta + \epsilon\) using \(A\).
Make predictions \(z\hat\beta\) for both samples.
Match records based on \(z\hat\beta\)
Good results to match individuals with similar “predicted” income.
Puts more “weight” on the variables used to predict the outcome.

Model the likehood of an observation being in \(A\) using \(Z\). \[P(D=A|Z) = G(Z\gamma)\]
Make predictions \(\hat P\) or \(z\hat\gamma\) for both samples.
Match records based on \(\hat\pi\)
General purpose score.
May be problematic if \(A\) and \(B\) have very similar distributions of \(Z\).
Puts more “weight” on the variables with different distributions between \(A\) and \(B\).

Use PCA to reduce dimensionality of \(Z\) into a single index.
- Can use either a single dataset or both
Make predictions of the first principal component \(PC1\)
Match records based on \(PC1\)
Puts more weight on variables that explain most of the variance in \(Z\).

Matching: Rank Matching

Most of distance based matching is usually feasible with unconstrained matching.
- thus, best records are always matched.
When considering constrained matching, distance based matching may not be adecuate
- While first records are matched the best, last records may be matched poorly match.
A balance therefore is to use rank matching.
- Rank observations based on a single variable (pscore, predicted mean, etc)
- Match records based on rank.
No match would be “best”, but reduces changes of poor matches.

Levy Matching Algorithm

At Levy, we use a constrained matching algorithm, with stratification and rank matching.

1. Data Harmonization

Because Data files come from different data sources, they may have different variables names, coding schemes, or definitions.
We need to set \(Z\) variables to be defined as identically as possible in both files
Beyond definition harmonization, one must also be mindful of the distribution of the variables in both files.
- If the distribution of \(Z\) is different in both files, the matching may not be adequate.
The weights schemes in both files should be adjusted to add up to the same population size (typically the “recipient” values)
- Weight adjustment could be done by selected strata

2. Estimation of Matching Score

Either using full or sub (strata) samples, estimate a matching score
- This could be a propensity score, predicted mean, or first principal component.
You may want to create “further cells” to improve matching. (not necessarily re estimate the matching score)
- For example, You consider Gender as strata (two scores), but further create cells by “age” (5 groups)

3. Perform the match

Using the finest definition of “cells”, rank observations based on Matching Scores
Using rank, match observations till all weights are exhausted.(from either Sample)
“unmatched” observations are left for later rounds using coarser definitions of cells.
Matching continues until all units (recipients) are matched.

4. Assessing the quality of the match

The idea is to compare the distribution of the “transfered/imputed” data with the distribution from the “donor” data.
- Overall distribution of the data will be the same by construction.
Compare distributions by Strata, smaller cells, or specific variables or interest.
Rule of thumb +/- 10% is acceptable (mean, median, Standard error).
- But it may depend on the variable of interest.
One may also use other approaches like “regression” to compare all variables at once.
If the distribution of the data is not adequate, one may want re-do the matching, with different “cell” definitions or matching scores.

Example

frause wage2, clear
set seed 312
xtile smp = runiform()
replace smp=smp==1
gen wage_s = wage if smp==1
**three Matching scores
** Pmm
reg wage_s hours iq kww educ exper tenure age married black south urban sibs 
predict wageh
** pscore
logit smp hours iq kww educ exper tenure age married black south urban sibs 
predict pscore, xb
** pca
pca hours iq kww educ exper tenure age married black south urban sibs , comp(1)
predict pc1

foreach i in wageh pscore pc1 {
    qui:sum `i'
    replace `i' = (`i'-r(mean))/r(sd)
}

(467 real changes made)
(467 missing values generated)

      Source |       SS           df       MS      Number of obs   =       468
-------------+----------------------------------   F(12, 455)      =     14.83
       Model |  24043666.1        12  2003638.84   Prob > F        =    0.0000
    Residual |  61493527.5       455   135150.61   R-squared       =    0.2811
-------------+----------------------------------   Adj R-squared   =    0.2621
       Total |  85537193.6       467  183163.155   Root MSE        =    367.63

------------------------------------------------------------------------------
      wage_s | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |  -4.080234   2.204689    -1.85   0.065    -8.412869    .2524015
          iq |   3.090831   1.459171     2.12   0.035     .2232806    5.958381
         kww |   5.637987   2.888128     1.95   0.052    -.0377362    11.31371
        educ |   53.43222   10.41882     5.13   0.000     32.95724     73.9072
       exper |   8.816439    5.30953     1.66   0.098    -1.617804    19.25068
      tenure |   6.327233   3.534248     1.79   0.074    -.6182417    13.27271
         age |   10.92113   7.195943     1.52   0.130    -3.220278    25.06253
     married |    143.306   54.35023     2.64   0.009     36.49735    250.1146
       black |  -144.0597   60.51784    -2.38   0.018    -262.9889    -25.1306
       south |  -37.37316    37.7745    -0.99   0.323    -111.6073    36.86096
       urban |   200.3258   38.93558     5.15   0.000       123.81    276.8417
        sibs |   1.881618   8.468635     0.22   0.824    -14.76087    18.52411
       _cons |  -842.6706     273.28    -3.08   0.002    -1379.718   -305.6231
------------------------------------------------------------------------------
(option xb assumed; fitted values)

Iteration 0:  Log likelihood = -648.09208  
Iteration 1:  Log likelihood = -642.21625  
Iteration 2:  Log likelihood = -642.21522  
Iteration 3:  Log likelihood = -642.21522  

Logistic regression                                     Number of obs =    935
                                                        LR chi2(12)   =  11.75
                                                        Prob > chi2   = 0.4657
Log likelihood = -642.21522                             Pseudo R2     = 0.0091

------------------------------------------------------------------------------
         smp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0153544   .0093201     1.65   0.099    -.0029127    .0336215
          iq |   .0024677   .0057346     0.43   0.667    -.0087719    .0137072
         kww |  -.0095774   .0113247    -0.85   0.398    -.0317735    .0126186
        educ |  -.0255069   .0409702    -0.62   0.534     -.105807    .0547932
       exper |   .0256009   .0205087     1.25   0.212    -.0145955    .0657973
      tenure |   .0131078   .0137404     0.95   0.340     -.013823    .0400385
         age |  -.0033883   .0281317    -0.12   0.904    -.0585256    .0517489
     married |  -.2382151   .2167632    -1.10   0.272    -.6630632    .1866331
       black |  -.1214461   .2276112    -0.53   0.594    -.5675559    .3246637
       south |  -.0370746   .1457381    -0.25   0.799    -.3227161    .2485669
       urban |  -.0371736   .1495076    -0.25   0.804    -.3302031     .255856
        sibs |   -.041735   .0313008    -1.33   0.182    -.1030835    .0196135
       _cons |  -.1246922   1.056515    -0.12   0.906    -2.195424    1.946039
------------------------------------------------------------------------------

Principal components/correlation                 Number of obs    =        935
                                                 Number of comp.  =          1
                                                 Trace            =         12
    Rotation: (unrotated = principal)            Rho              =     0.2112

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |       2.5348      .622213             0.2112       0.2112
           Comp2 |      1.91258      .821455             0.1594       0.3706
           Comp3 |      1.09113     .0297471             0.0909       0.4615
           Comp4 |      1.06138     .0497587             0.0884       0.5500
           Comp5 |      1.01162     .0867533             0.0843       0.6343
           Comp6 |      .924871     .0639678             0.0771       0.7114
           Comp7 |      .860903     .0907871             0.0717       0.7831
           Comp8 |      .770116      .144885             0.0642       0.8473
           Comp9 |      .625231      .119791             0.0521       0.8994
          Comp10 |       .50544     .0919858             0.0421       0.9415
          Comp11 |      .413454      .124988             0.0345       0.9760
          Comp12 |      .288466            .             0.0240       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    --------------------------------------
        Variable |    Comp1 | Unexplained 
    -------------+----------+-------------
           hours |   0.1320 |       .9558 
              iq |   0.4921 |       .3863 
             kww |   0.4220 |       .5486 
            educ |   0.4555 |        .474 
           exper |  -0.2161 |       .8817 
          tenure |   0.0463 |       .9946 
             age |   0.0527 |        .993 
         married |   0.0053 |       .9999 
           black |  -0.3722 |       .6488 
           south |  -0.2076 |       .8908 
           urban |   0.0723 |       .9868 
            sibs |  -0.3411 |       .7051 
    --------------------------------------
(score assumed)

Scoring coefficients 
    sum of squares(column-loading) = 1

    ------------------------
        Variable |    Comp1 
    -------------+----------
           hours |   0.1320 
              iq |   0.4921 
             kww |   0.4220 
            educ |   0.4555 
           exper |  -0.2161 
          tenure |   0.0463 
             age |   0.0527 
         married |   0.0053 
           black |  -0.3722 
           south |  -0.2076 
           urban |   0.0723 
            sibs |  -0.3411 
    ------------------------
(935 real changes made)
(935 real changes made)
(935 real changes made)

Next we create ranks for each observation, assuming no stratification.

Code

bysort smp (wageh) :gen rnk1=_n
bysort smp (pscore):gen rnk2=_n
bysort smp (pc1)   :gen rnk3=_n

Finally, the imputation. Simply “matching” information from the donor to the recipient.

* Imputation
clonevar wage1 = wage_s
clonevar wage2 = wage_s
clonevar wage3 = wage_s

gsort -smp rnk1
replace wage1 = wage_s[rnk1] if smp==0

gsort -smp rnk2
replace wage2 = wage_s[rnk2] if smp==0

gsort -smp rnk3
replace wage3 = wage_s[rnk3] if smp==0

(467 missing values generated)
(467 missing values generated)
(467 missing values generated)
(467 real changes made)
(467 real changes made)
(467 real changes made)

Simple quality assessment.

qui:reg wage hours iq kww educ exper tenure age married black south if smp==0
est sto m1
qui:reg wage1 hours iq kww educ exper tenure age married black south if smp==0
est sto m2
qui:reg wage2 hours iq kww educ exper tenure age married black south if smp==0
est sto m3
qui:reg wage3 hours iq kww educ exper tenure age married black south if smp==0
est sto m4
esttab m1 m2 m3 m4 , se mtitle(True Wageh pscore pca)


----------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)   
                     True           Wageh          pscore             pca   
----------------------------------------------------------------------------
hours              -3.076          -5.508*          0.208          -0.971   
                  (2.482)         (2.678)         (3.109)         (2.825)   

iq                  2.579           0.343          -2.775           3.653*  
                  (1.411)         (1.522)         (1.767)         (1.605)   

kww                 5.410*          8.750**         1.954           7.545*  
                  (2.745)         (2.961)         (3.438)         (3.124)   

educ                46.14***        77.82***        8.331           44.53***
                  (10.18)         (10.98)         (12.75)         (11.58)   

exper               11.66*          15.17**        -2.006           5.108   
                  (4.980)         (5.372)         (6.237)         (5.667)   

tenure              3.844           5.238          -0.511          0.0757   
                  (3.332)         (3.595)         (4.174)         (3.792)   

age                -0.945          -9.667          -10.91          -5.380   
                  (6.905)         (7.449)         (8.648)         (7.858)   

married             187.2***        165.6**         89.99           38.58   
                  (54.26)         (58.54)         (67.96)         (61.75)   

black              -64.09          -99.07           46.48          -39.87   
                  (52.43)         (56.57)         (65.67)         (59.67)   

south              -80.79*         -74.14           9.854          -120.9** 
                  (35.19)         (37.96)         (44.07)         (40.04)   

_cons              -254.2          -199.6          1349.4***       -104.9   
                  (251.2)         (271.0)         (314.6)         (285.9)   
----------------------------------------------------------------------------
N                     467             467             467             467   
----------------------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Next Class: Micro Simulation

just more imputations