Research Methods II

Session 6: Imputation: Statistical Matching

Fernando Rios-Avila

Missing Data vs MISSING DATA

Missing Data

  • As we described before, missing data is a problem for micro data analysis.
    • Reduces sample size, statistical power, and may bias estimates. (depending on the type of missingness)
  • We have also discussed that there are few ways to deal with missing data.
    • Complete case analysis
    • Reweighting
    • Imputation: Prediction
    • Imputation: Hotdecking
  • This methods allows you solve for missing data if data is MCAR or MAR.
    • with MNAR, dealing with missing data is difficult
  • Nevertheless, you can deal with Missing data, because you have some observed data that can be used to impute it.

Types of Missing data

MISSING DATA

  • What would happen if all data is missing?

  • Example:

    • You are working with the CPS, but are interested in looking at the relationship between income and time use.
      • CPS does NOT have time use data.
  • We are going for the -lion hunt-

    • You can’t impute time use data
    • You can’t use complete case analysis
    • You can’t use reweighting
    • You can’t use hotdecking
    • what do we do?

What do we do?

  • One option would be using a different data set.

    • In the US, the American Time Use Survey (ATUS) could be a good option.
    • But…The data has no income information!
  • What if we could combine the two data sets?

  • This changes the problem from Missing all data, to one of Missing Data by design.

    • Some segment of the population was asked about income, and some other segment was asked about time use.

Imputation and Statistical Matching

  • If you consider the idea of combining two data sets, you can treat the problem as one of imputation.

    • You have a sample (two) that represents the population of interest.
    • We can reasonably assume the data is MCAR. But the data of interest is not observed at the same time.
    • Then, we can use the combine data to impute the missing data, using many of the approaches we have discussed before.
  • And there is also another method that is more commonly used (at Levy) to deal with this problem.

    • Statistical Matching (aka Data Fusion).
  • What does this imply?:

    • Match individuals across datasets (“Donor” and “Recipient”)
    • Transfer information based on the matching links

Official examples:

There is a lot of work on this topic. Many statistical agencies use this approach to combine data Survey data with administrative data.

  • Administrative data is usually more accurate, but it is not collected for the purpose of research.
  • Survey data is collected for research purposes, but may not have accurate data in some areas (income)
  • Unless Survey Data was collected with the purpose of being linked with administrative data, one requires methods similar to statistical matching to combine both data sets.

In house Some examples:

  • At Levy we have used this approach to produce relevant datasets:
    • LIMEW: Levy Institute Measure of Economic Well-Being Combines Time use, wealtgh and Survey data (in addition to other aggregate data)

    • LIMTIP: Levy Institute Measure of Time and Income Poverty Combines ATUS, with income/consumption data

Framework

What do we need?

  • Consider two data sets: \(A\) and \(B\).
  • \(A\) has informtion on \(X\) and \(Z\)
  • \(B\) has information on \(Y\) and \(Z\)
  • We want a file that has \(X\), \(Y\) and \(Z\).

Assumptions

  • (\(X,Y,Z\)) are multivariate random variables with joint distribution \(f(x,y,z)\), that represents the population of interest.

  • Both datasets are random samples from the same population of interest.

    \(\frac{P_w(D=A|X,Y,Z)}{P_w(D=B|X,Y,Z)} = \frac{P(D=A)}{P(D=B)} = 1\)

  • Conditional Independence assumption:

    • \(Y\) and \(Z\) are independent from each other given \(X\). \[f(x,y|z) = f(x|z)f(y|z)\]
  • The goal is to combine the two data sets to produce a file that has data on \(X\), \(Y\) and \(Z\). by identifying \(f(x,y,z)\).

Statistical Matching: Limitations

  • The quality of this identification will depend on how well the conditional independence assumption holds.

  • Because of this, synthetic datasets can’t tell you much about covariances or causal relationships

\[Cov(z,y,z) = \begin{pmatrix} V(X) & \color{red}{V(X,Y)} & V(X,Z) \\ \color{red}{V(X,Y)'} & V(Y) & V(Y,Z) \\ V(X,Z)' & V(Y,Z)' & V(Z) \end{pmatrix} \]

albeit, you can impose certain bounderies on the covariance matrix.

Matching Approaches:

There are two types of statistical matching procedures:

  • Records from \(A\) and \(B\) can be used multiple times (or none) in the matching.

    • Absurd case: One observation from \(A\) is matched with all observations from \(B\).
  • This is the most common approach in the literature for policy evaluation

    Pros: Uses the “best” candidate for the matching. Cons: It may not transfer the uncoditional distribution of the data.

  • Does not necessarily required \(A\) and \(B\) to be from the same population. (weighted size)

  • All records from \(A\) and \(B\) are used once and only once in the matching. (without replacement)
  • When using weighted samples, records are matched until the weights are exhausted.
    • Requires \(A\) and \(B\) to be from the same population. (weighted size)
    Pros: It transfers the unconditional distribution of the data. Cons: My not use “best” candidate for the matching.

Matching Records:

  • Matching records, requires defining a measure of similarity between records.

  • This measures can vary depending on the data type, and dimensionality of the data

\[\begin{aligned} \text{ Euclidian: } d(r^A,r^b) &= \sqrt{\sum_i^k(x^A_i-x^B_i)^2 } \\ \text{ SdEuclidian: } d(r^A,r^b) &= \sqrt{\sum_i^k\left(\frac{x^A_i-x^B_i}{\sigma_j}\right)^2 } \\ \text{ Mahalanobis: } d(r^A,r^b) &= \sqrt{(x^A-x^B)'\Sigma_x^{-1}(x^A-x^B)} \\ \end{aligned} \]

  • All this measures are useful when one has high dimensional data.

Matching: Reducing Dimensionality

  • A second alternative is to reduce data dimensionality before estimating distances.
  • Model \(x = z\beta + \epsilon\) using \(A\).
  • Make predictions \(z\hat\beta\) for both samples.
  • Match records based on \(z\hat\beta\)
  • Good results to match individuals with similar “predicted” income.
  • Puts more “weight” on the variables used to predict the outcome.
  • Model the likehood of an observation being in \(A\) using \(Z\). \[P(D=A|Z) = G(Z\gamma)\]

  • Make predictions \(\hat P\) or \(z\hat\gamma\) for both samples.

  • Match records based on \(\hat\pi\)

  • General purpose score.

  • May be problematic if \(A\) and \(B\) have very similar distributions of \(Z\).

  • Puts more “weight” on the variables with different distributions between \(A\) and \(B\).

  • Use PCA to reduce dimensionality of \(Z\) into a single index.
    • Can use either a single dataset or both
  • Make predictions of the first principal component \(PC1\)
  • Match records based on \(PC1\)
  • Puts more weight on variables that explain most of the variance in \(Z\).

Matching: Rank Matching

  • Most of distance based matching is usually feasible with unconstrained matching.
    • thus, best records are always matched.
  • When considering constrained matching, distance based matching may not be adecuate
    • While first records are matched the best, last records may be matched poorly match.
  • A balance therefore is to use rank matching.
    • Rank observations based on a single variable (pscore, predicted mean, etc)
    • Match records based on rank.
  • No match would be “best”, but reduces changes of poor matches.

Levy Matching Algorithm

  • At Levy, we use a constrained matching algorithm, with stratification and rank matching.

1. Data Harmonization

  • Because Data files come from different data sources, they may have different variables names, coding schemes, or definitions.
  • We need to set \(Z\) variables to be defined as identically as possible in both files
  • Beyond definition harmonization, one must also be mindful of the distribution of the variables in both files.
    • If the distribution of \(Z\) is different in both files, the matching may not be adequate.
  • The weights schemes in both files should be adjusted to add up to the same population size (typically the “recipient” values)
    • Weight adjustment could be done by selected strata

2. Estimation of Matching Score

  • Either using full or sub (strata) samples, estimate a matching score
    • This could be a propensity score, predicted mean, or first principal component.
  • You may want to create “further cells” to improve matching. (not necessarily re estimate the matching score)
    • For example, You consider Gender as strata (two scores), but further create cells by “age” (5 groups)

3. Perform the match

  • Using the finest definition of “cells”, rank observations based on Matching Scores
  • Using rank, match observations till all weights are exhausted.(from either Sample)
  • “unmatched” observations are left for later rounds using coarser definitions of cells.
  • Matching continues until all units (recipients) are matched.

4. Assessing the quality of the match

  • The idea is to compare the distribution of the “transfered/imputed” data with the distribution from the “donor” data.

    • Overall distribution of the data will be the same by construction.
  • Compare distributions by Strata, smaller cells, or specific variables or interest.

  • Rule of thumb +/- 10% is acceptable (mean, median, Standard error).

    • But it may depend on the variable of interest.
  • One may also use other approaches like “regression” to compare all variables at once.

  • If the distribution of the data is not adequate, one may want re-do the matching, with different “cell” definitions or matching scores.

Example

frause wage2, clear
set seed 312
xtile smp = runiform()
replace smp=smp==1
gen wage_s = wage if smp==1
**three Matching scores
** Pmm
reg wage_s hours iq kww educ exper tenure age married black south urban sibs 
predict wageh
** pscore
logit smp hours iq kww educ exper tenure age married black south urban sibs 
predict pscore, xb
** pca
pca hours iq kww educ exper tenure age married black south urban sibs , comp(1)
predict pc1

foreach i in wageh pscore pc1 {
    qui:sum `i'
    replace `i' = (`i'-r(mean))/r(sd)
}
(467 real changes made)
(467 missing values generated)

      Source |       SS           df       MS      Number of obs   =       468
-------------+----------------------------------   F(12, 455)      =     14.83
       Model |  24043666.1        12  2003638.84   Prob > F        =    0.0000
    Residual |  61493527.5       455   135150.61   R-squared       =    0.2811
-------------+----------------------------------   Adj R-squared   =    0.2621
       Total |  85537193.6       467  183163.155   Root MSE        =    367.63

------------------------------------------------------------------------------
      wage_s | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |  -4.080234   2.204689    -1.85   0.065    -8.412869    .2524015
          iq |   3.090831   1.459171     2.12   0.035     .2232806    5.958381
         kww |   5.637987   2.888128     1.95   0.052    -.0377362    11.31371
        educ |   53.43222   10.41882     5.13   0.000     32.95724     73.9072
       exper |   8.816439    5.30953     1.66   0.098    -1.617804    19.25068
      tenure |   6.327233   3.534248     1.79   0.074    -.6182417    13.27271
         age |   10.92113   7.195943     1.52   0.130    -3.220278    25.06253
     married |    143.306   54.35023     2.64   0.009     36.49735    250.1146
       black |  -144.0597   60.51784    -2.38   0.018    -262.9889    -25.1306
       south |  -37.37316    37.7745    -0.99   0.323    -111.6073    36.86096
       urban |   200.3258   38.93558     5.15   0.000       123.81    276.8417
        sibs |   1.881618   8.468635     0.22   0.824    -14.76087    18.52411
       _cons |  -842.6706     273.28    -3.08   0.002    -1379.718   -305.6231
------------------------------------------------------------------------------
(option xb assumed; fitted values)

Iteration 0:  Log likelihood = -648.09208  
Iteration 1:  Log likelihood = -642.21625  
Iteration 2:  Log likelihood = -642.21522  
Iteration 3:  Log likelihood = -642.21522  

Logistic regression                                     Number of obs =    935
                                                        LR chi2(12)   =  11.75
                                                        Prob > chi2   = 0.4657
Log likelihood = -642.21522                             Pseudo R2     = 0.0091

------------------------------------------------------------------------------
         smp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0153544   .0093201     1.65   0.099    -.0029127    .0336215
          iq |   .0024677   .0057346     0.43   0.667    -.0087719    .0137072
         kww |  -.0095774   .0113247    -0.85   0.398    -.0317735    .0126186
        educ |  -.0255069   .0409702    -0.62   0.534     -.105807    .0547932
       exper |   .0256009   .0205087     1.25   0.212    -.0145955    .0657973
      tenure |   .0131078   .0137404     0.95   0.340     -.013823    .0400385
         age |  -.0033883   .0281317    -0.12   0.904    -.0585256    .0517489
     married |  -.2382151   .2167632    -1.10   0.272    -.6630632    .1866331
       black |  -.1214461   .2276112    -0.53   0.594    -.5675559    .3246637
       south |  -.0370746   .1457381    -0.25   0.799    -.3227161    .2485669
       urban |  -.0371736   .1495076    -0.25   0.804    -.3302031     .255856
        sibs |   -.041735   .0313008    -1.33   0.182    -.1030835    .0196135
       _cons |  -.1246922   1.056515    -0.12   0.906    -2.195424    1.946039
------------------------------------------------------------------------------

Principal components/correlation                 Number of obs    =        935
                                                 Number of comp.  =          1
                                                 Trace            =         12
    Rotation: (unrotated = principal)            Rho              =     0.2112

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |       2.5348      .622213             0.2112       0.2112
           Comp2 |      1.91258      .821455             0.1594       0.3706
           Comp3 |      1.09113     .0297471             0.0909       0.4615
           Comp4 |      1.06138     .0497587             0.0884       0.5500
           Comp5 |      1.01162     .0867533             0.0843       0.6343
           Comp6 |      .924871     .0639678             0.0771       0.7114
           Comp7 |      .860903     .0907871             0.0717       0.7831
           Comp8 |      .770116      .144885             0.0642       0.8473
           Comp9 |      .625231      .119791             0.0521       0.8994
          Comp10 |       .50544     .0919858             0.0421       0.9415
          Comp11 |      .413454      .124988             0.0345       0.9760
          Comp12 |      .288466            .             0.0240       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    --------------------------------------
        Variable |    Comp1 | Unexplained 
    -------------+----------+-------------
           hours |   0.1320 |       .9558 
              iq |   0.4921 |       .3863 
             kww |   0.4220 |       .5486 
            educ |   0.4555 |        .474 
           exper |  -0.2161 |       .8817 
          tenure |   0.0463 |       .9946 
             age |   0.0527 |        .993 
         married |   0.0053 |       .9999 
           black |  -0.3722 |       .6488 
           south |  -0.2076 |       .8908 
           urban |   0.0723 |       .9868 
            sibs |  -0.3411 |       .7051 
    --------------------------------------
(score assumed)

Scoring coefficients 
    sum of squares(column-loading) = 1

    ------------------------
        Variable |    Comp1 
    -------------+----------
           hours |   0.1320 
              iq |   0.4921 
             kww |   0.4220 
            educ |   0.4555 
           exper |  -0.2161 
          tenure |   0.0463 
             age |   0.0527 
         married |   0.0053 
           black |  -0.3722 
           south |  -0.2076 
           urban |   0.0723 
            sibs |  -0.3411 
    ------------------------
(935 real changes made)
(935 real changes made)
(935 real changes made)

Next we create ranks for each observation, assuming no stratification.

Code
bysort smp (wageh) :gen rnk1=_n
bysort smp (pscore):gen rnk2=_n
bysort smp (pc1)   :gen rnk3=_n

Finally, the imputation. Simply “matching” information from the donor to the recipient.

* Imputation
clonevar wage1 = wage_s
clonevar wage2 = wage_s
clonevar wage3 = wage_s

gsort -smp rnk1
replace wage1 = wage_s[rnk1] if smp==0

gsort -smp rnk2
replace wage2 = wage_s[rnk2] if smp==0

gsort -smp rnk3
replace wage3 = wage_s[rnk3] if smp==0
(467 missing values generated)
(467 missing values generated)
(467 missing values generated)
(467 real changes made)
(467 real changes made)
(467 real changes made)

Simple quality assessment.

qui:reg wage hours iq kww educ exper tenure age married black south if smp==0
est sto m1
qui:reg wage1 hours iq kww educ exper tenure age married black south if smp==0
est sto m2
qui:reg wage2 hours iq kww educ exper tenure age married black south if smp==0
est sto m3
qui:reg wage3 hours iq kww educ exper tenure age married black south if smp==0
est sto m4
esttab m1 m2 m3 m4 , se mtitle(True Wageh pscore pca)

----------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)   
                     True           Wageh          pscore             pca   
----------------------------------------------------------------------------
hours              -3.076          -5.508*          0.208          -0.971   
                  (2.482)         (2.678)         (3.109)         (2.825)   

iq                  2.579           0.343          -2.775           3.653*  
                  (1.411)         (1.522)         (1.767)         (1.605)   

kww                 5.410*          8.750**         1.954           7.545*  
                  (2.745)         (2.961)         (3.438)         (3.124)   

educ                46.14***        77.82***        8.331           44.53***
                  (10.18)         (10.98)         (12.75)         (11.58)   

exper               11.66*          15.17**        -2.006           5.108   
                  (4.980)         (5.372)         (6.237)         (5.667)   

tenure              3.844           5.238          -0.511          0.0757   
                  (3.332)         (3.595)         (4.174)         (3.792)   

age                -0.945          -9.667          -10.91          -5.380   
                  (6.905)         (7.449)         (8.648)         (7.858)   

married             187.2***        165.6**         89.99           38.58   
                  (54.26)         (58.54)         (67.96)         (61.75)   

black              -64.09          -99.07           46.48          -39.87   
                  (52.43)         (56.57)         (65.67)         (59.67)   

south              -80.79*         -74.14           9.854          -120.9** 
                  (35.19)         (37.96)         (44.07)         (40.04)   

_cons              -254.2          -199.6          1349.4***       -104.9   
                  (251.2)         (271.0)         (314.6)         (285.9)   
----------------------------------------------------------------------------
N                     467             467             467             467   
----------------------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Next Class: Micro Simulation

just more imputations