Pool Cross-section and Panel Data

One year is no longer enough

Fernando Rios-Avila

Pooling Data together: Cross-section and Panel Data

Up to this point, we have cover the analysis of cross-section data.
- Many individuals at a single point in time.
Towards the end of the semester, We will also cover the analysis of time series data.
- A single individual across time.
Today, we will cover the analysis of panel data and repeated crossection: Many individuals across time.
This type of data, also known as longitudinal data, has advantages over crossection, as it provides more information that helps dealing with the unknown of \(e\).
And its often the only way to answer certain questions.

Pooling independent crossections

We first consider the case of independent crossections.
- We have access to surveys that may be collected regularly. (Household budget surveys)
- We assume that individuals across this surveys are independent from each other (no panel structure).
This scenario is typically used for increasing sample-sizes and thus power of analysis (larger N smaller SE)
Only minor considerations are needed when analyzing this type of data.
- We need to account for the fact Data comes from different years. This can be done by including year dummies.
- May need to Standardize variables to make them comparable across years. (inflation adjustments, etc.)

Example

Lets use the data fertil1 to estimate the changes in fertility rates across time. This data comes from the General Social Survey.

Code

frause fertil1, clear
regress kids educ age agesq black east northcen west farm othrural town smcity i.year, robust


Linear regression                               Number of obs     =      1,129
                                                F(17, 1111)       =      10.19
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1295
                                                Root MSE          =     1.5548

------------------------------------------------------------------------------
             |               Robust
        kids | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |  -.1284268    .021146    -6.07   0.000    -.1699175   -.0869362
         age |   .5321346   .1389371     3.83   0.000     .2595258    .8047433
       agesq |   -.005804   .0015791    -3.68   0.000    -.0089024   -.0027056
       black |   1.075658   .2013188     5.34   0.000     .6806496    1.470666
        east |    .217324    .127466     1.70   0.088    -.0327773    .4674252
    northcen |    .363114   .1167013     3.11   0.002     .1341342    .5920939
        west |   .1976032   .1626813     1.21   0.225     -.121594    .5168003
        farm |  -.0525575   .1460837    -0.36   0.719    -.3391886    .2340736
    othrural |  -.1628537   .1808546    -0.90   0.368    -.5177087    .1920014
        town |   .0843532   .1284759     0.66   0.512    -.1677295    .3364359
      smcity |   .2118791   .1539645     1.38   0.169    -.0902149    .5139731
             |
        year |
         74  |   .2681825   .1875121     1.43   0.153    -.0997353    .6361003
         76  |  -.0973795   .1999339    -0.49   0.626    -.4896701    .2949112
         78  |  -.0686665   .1977154    -0.35   0.728    -.4566042    .3192713
         80  |  -.0713053   .1936553    -0.37   0.713    -.4512767    .3086661
         82  |  -.5224842   .1879305    -2.78   0.006    -.8912228   -.1537456
         84  |  -.5451661   .1859289    -2.93   0.003    -.9099776   -.1803547
             |
       _cons |  -7.742457   3.070656    -2.52   0.012     -13.7674   -1.717518
------------------------------------------------------------------------------

This allow us to see how fertility rates have changed across time.
One could even interact the year dummies with other variables to see how the effect of other variables have changed across time.

Code

frause cps78_85, clear
regress lwage i.year##c.(educ i.female) exper expersq union, robust cformat(%5.4f)


Linear regression                               Number of obs     =      1,084
                                                F(8, 1075)        =     110.48
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4262
                                                Root MSE          =      .4127

------------------------------------------------------------------------------
             |               Robust
       lwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     85.year |     0.1178     0.1239     0.95   0.342      -0.1253      0.3609
        educ |     0.0747     0.0060    12.40   0.000       0.0629      0.0865
    1.female |    -0.3167     0.0347    -9.12   0.000      -0.3848     -0.2486
             |
 year#c.educ |
         85  |     0.0185     0.0095     1.94   0.053      -0.0002      0.0371
             |
 year#female |
       85 1  |     0.0851     0.0518     1.64   0.101      -0.0165      0.1866
             |
       exper |     0.0296     0.0037     8.10   0.000       0.0224      0.0368
     expersq |    -0.0004     0.0001    -5.11   0.000      -0.0006     -0.0002
       union |     0.2021     0.0293     6.89   0.000       0.1446      0.2597
       _cons |     0.4589     0.0855     5.37   0.000       0.2911      0.6267
------------------------------------------------------------------------------

Good old Friend: Chow test

The Chow test can be used to test whether the coefficients of a regression model are the same across two groups.
- we have seen this test back when we were discussing dummy variables.
We can also use this test to check if coefficients of a regression model are the same across two time periods. (Has the wage structure changed across time?)
- This is the case of interest here.
Not much changes with before. Although it can be a bit more tedious to code.

Example

Code

frause cps78_85, clear
regress lwage i.year##c.(educ i.female exper expersq i.union), robust


Linear regression                               Number of obs     =      1,084
                                                F(11, 1072)       =      82.83
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4276
                                                Root MSE          =     .41278

------------------------------------------------------------------------------
             |               Robust
       lwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     85.year |   .1219978   .1521927     0.80   0.423    -.1766315    .4206271
        educ |   .0768148   .0063312    12.13   0.000     .0643918    .0892378
    1.female |  -.3155108   .0348402    -9.06   0.000    -.3838737    -.247148
       exper |   .0249177   .0042985     5.80   0.000     .0164833    .0333522
     expersq |  -.0002844   .0000918    -3.10   0.002    -.0004645   -.0001043
     1.union |   .2039824   .0381315     5.35   0.000     .1291616    .2788033
             |
 year#c.educ |
         85  |    .013927   .0103252     1.35   0.178    -.0063329    .0341869
             |
 year#female |
       85 1  |   .0846136   .0524618     1.61   0.107    -.0183258     .187553
             |
year#c.exper |
         85  |   .0095289   .0073767     1.29   0.197    -.0049454    .0240033
             |
        year#|
   c.expersq |
         85  |  -.0002399   .0001592    -1.51   0.132    -.0005522    .0000724
             |
  year#union |
       85 1  |  -.0018095   .0594387    -0.03   0.976    -.1184389      .11482
             |
       _cons |    .458257     .09386     4.88   0.000     .2740868    .6424271
------------------------------------------------------------------------------

test 85.year#c.educ 85.year#1.female 85.year#c.exper   85.year#c.expersq 85.year#1.union


 ( 1)  85.year#c.educ = 0
 ( 2)  85.year#1.female = 0
 ( 3)  85.year#c.exper = 0
 ( 4)  85.year#c.expersq = 0
 ( 5)  85.year#1.union = 0

       F(  5,  1072) =    1.65
            Prob > F =    0.1443

Using Pool Crossection for Causal Inference

One advantage of pooling crossection data is that it could to be used to estimate causal effects using a method known as Differences in Differences (DnD)
Consider the following case:
- There was a project regarding the construction of an incinerator in a city. You are asked to evaluate what the impact of this was on the prices of houses around the area.
- You have access to data for two years: 1978 and 1981.
- In 1978, there was no information about the project. In 1981, the project was announced, but it only began operations in 1985.

we could start estimating the project using the simple model: \[rprice = \beta_0 + \beta_1 nearinc + e\]

using only 1981 data. But this would not be a good idea. Why?

Code

frause kielmc, clear
regress rprice nearinc if year == 1981, robust


Linear regression                               Number of obs     =        142
                                                F(1, 140)         =      24.35
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1653
                                                Root MSE          =      31238

------------------------------------------------------------------------------
             |               Robust
      rprice | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     nearinc |  -30688.27   6219.265    -4.93   0.000     -42984.1   -18392.45
       _cons |   101307.5   2951.195    34.33   0.000     95472.84    107142.2
------------------------------------------------------------------------------

We could also estimate the model using only 1971 data. What would this be showing us?

Code

regress rprice nearinc if year == 1978, robust


Linear regression                               Number of obs     =        179
                                                F(1, 177)         =       9.87
                                                Prob > F          =     0.0020
                                                R-squared         =     0.0817
                                                Root MSE          =      29432

------------------------------------------------------------------------------
             |               Robust
      rprice | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     nearinc |  -18824.37   5992.564    -3.14   0.002    -30650.44   -6998.302
       _cons |   82517.23   1881.165    43.86   0.000     78804.83    86229.63
------------------------------------------------------------------------------

So, using 1981 data we capture the Total price difference between houses near and far from the incinerator.
- This captures both the announcement effect of the project, but also other factors (where would an incinerator be built?).
Using 1978 data we capture the price difference between houses near and far from the incinerator in the absence of the project.
- This captures the effect of other factors that may be correlated with the incinerator project.
Use both to see the impact!

\[Effect = -30688.27-(-18824.37)= -11863.9\]

This is in essence a DnD model

Difference in Differences

	Control	Treatment	Treat-Control
Pre-	\(\bar y_{00}\)	\(\bar y_{10}\)	\(\bar y_{10}\)-\(\bar y_{00}\)
Post-	\(\bar y_{01}\)	\(\bar y_{11}\)	\(\bar y_{10}\)-\(\bar y_{00}\)
Post-pre	\(\bar y_{01}\)-\(\bar y_{00}\)	\(\bar y_{11}\)-\(\bar y_{10}\)	DD

Post-Pre:
- Trend changes for the control
- Trend changes for the treated: A mix of the impact of the treatment and the trend change.
Treat-Control:
- Baseline difference when looking at Pre-period
- Total Price differentials when looking at Post-period: Mix of the impact of the treatment and the baseline difference.
Take the Double Difference and you get the treatment effect.

Difference in Differences: Regression

This could also be achieved using a regression model:

\[ y = \beta_0 + \beta_1 post + \beta_2 treat + \beta_3 post*treat + e\]

Where \(\beta_3\) is the treatment effect. (only for 2x2 DD)

Code

regress rprice nearinc##y81, robust


Linear regression                               Number of obs     =        321
                                                F(3, 317)         =      17.75
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1739
                                                Root MSE          =      30243

------------------------------------------------------------------------------
             |               Robust
      rprice | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   1.nearinc |  -18824.37    5996.47    -3.14   0.002    -30622.28   -7026.461
       1.y81 |   18790.29   3498.376     5.37   0.000     11907.32    25673.26
             |
 nearinc#y81 |
        1 1  |   -11863.9   8635.585    -1.37   0.170    -28854.21    5126.401
             |
       _cons |   82517.23   1882.391    43.84   0.000     78813.67    86220.79
------------------------------------------------------------------------------

Difference in Differences: Regression + controls

One advantage of DD is that it can control for those unobserved factors that may be correlated with outcome.
- Without controls, however, estimates may not have enough precision.
But, we could add controls!

\[ y = \beta_0 + X \gamma + \beta_1 post + \beta_2 treat + \beta_3 post*treat + e\]

But its not as easy as it may seem! (just adding regressions is not a good approach)
This method requires other assumptions! (\(\gamma\) is fixed), which may be very strong.

Note: For DD to work, you need to assume the two groups follow the same path in the absence of the treatment. (Parallel trends assumption)

Otherwise, you are just using trend differences!

Diff in Diff in Diff

An Alternative approach is to use a triple difference model.

Setup:

You still have two groups: Control and Treatment (which are easily identifiable)
You have two time periods: Pre and Post (which are also easily identifiable)
You have a different sample, where you can identify controls and treatment, as well as the pre- and post- periods. This sample was not treated!

Estimation:

Estimate the DD for the Original Sample, and the new untreated sample.
Obtaining the difference between these two estimates will give you the triple difference.

Example: Smoking ban analysis based on age. (DD) But using both treated and untreated States (DDD)

General Framework and Pseudo Panels

One general Structure for Policy analysis is the use of Pseudo Panels structure.
- Pseudo panels are a way to use repeated crossection data, but controlling for some unobserved heterogeneity across specific groups. (the pseudo panels)
For Pseudo-panels, we need to identify a group that could be followed across time.
- This cannot be a group of individuals (repeated crosection).
- But we could use groups of states, cohorts (year of birth), etc.
In this case, the data would look like this: \[y_{igt} = \lambda_t + \alpha_g + \beta x_{gt} + z_{igt}\gamma + e_{igt}\]
Where \(g\) is the group, \(t\) is the time, and \(i\) is the individual.
This model can be estimated by using dummies. (one dummy for each group and time-period)
And \(\beta\) is the coefficient of interest. (impact of the Policy \(x_{gt}\)).
- This may ony work if we assume \(\beta\) is constant across time and groups.

Alternative

We could also use a more general model: \[y_{igt} = \lambda_{gt}+ \beta x_{gt} + z_{igt}\gamma + e_{igt}\]
where \(\lambda_{gt}\) is a group-time fixed effect. (Dummy for each group-time combination)
- Nevertheless, while more flexible, this also imposes other types of assumptions, and might even be unfeasible if we have a large number of groups and time periods.
Still, we require \(\beta\) to be homogenous. If that is not the case, you may still suffer from contamination bias.

Panel data

Baby steps: 2 period panel data

2-period Panel data

Panel Data, or longitudinal data, is a type of data that has information about the same individual across time.
The simplest Structure is one where individuals are followed over only 2 periods.
The main advantage of panel data (even two periods version) is that it allows us to control for unobserved heterogeneity across individuals.
- But only if you want to assume fixed effects are constant across time.

So how does this reflects in the model specification?

\[y_{it} = \beta_0 + \beta_1 x_{it} + \beta_2 z_{t} + \beta_3 w_{i} + e_i + e_t + e_{it}\]

Where \(i\) refers to individuals or panel units, and \(t\) refers to time periods.
Also, \(X's\), \(X's\) \(W's\) are variables that vary across individual and time, across time or across individuals.
There are also three types of errors. Those that contains unobserved that vary across individuals \(e_i\), across time \(e_t\), and across individuals and time \(e_{it}\) (Idiosyncratic error).
\(e_i\) is usually referred to as the individual fixed effect, and \(e_t\) as the time fixed effect.
In a 2 period panel, controlling for time-effects is may not be necessary (its just one dummy)
What is more concerning is the unobserved individual fixed effect.

This is pretty similar to the generalized Pooling model we saw before.

How estimation changes

For time use, we assume we control with a single dummy.

You can choose to “ignore” individual effects.

\[y_{it} = \beta_0 + \beta_1 x_{it} + \beta_2 w_{i} + \delta t + (v_{it} = e_i + e_{it})\]

Requires \(e_i\) to be uncorrelated with \(x_{it}\) (otherwise is biased), and Standard Errors will need to be clustered at the individual level.

You can aim to estimate all individual fixed effects using dummies (FE estimator). \[y_{it} = \beta_0 + \beta_1 x_{it} + \delta t + \sum \alpha_i D_i + e_{it})\]
- Time fixed variables cannot be estimated anymore

You can estimate the model in differences (FD estimator)

\[\begin{aligned} y_{i1} &= \beta_0 + \beta_1 x_{i1} + \delta + e_i + e_{i1} \\ y_{i0} &= \beta_0 + \beta_1 x_{i0} + e_i + e_{i0} \\ \Delta y_{i} &= \ \ \ \ \ \ \ \ \ \beta_1 \Delta x_{i1} + \delta + \Delta e_{i} \end{aligned} \]

Now you have only 1 observation per panel, instead of 2. And the result would be identical to FE estimator.

Example

Code

** This data is in wide format
frause slp75_81, clear
** Lets reshape it so its in standard long format
gen id = _n
reshape long educ gdhlth marr slpnap totwrk yngkid, i(id) j(year)
gen time = year==81
xtset id time
** Regression as Pool Crossection
qui: reg slpnap time totwrk educ marr yngkid gdhlth male, cluster(id)
est sto m1
** using FE
qui: areg slpnap time totwrk educ marr yngkid gdhlth male, absorb(id) cluster(id)
est sto m2
** using FD
qui: reg d.slpnap d.time d.totwrk d.educ d.marr d.yngkid d.gdhlth d.male, robust
est sto m3

(j = 75 81)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations              239   ->   478         
Number of variables                  21   ->   16          
j variable (2 values)                     ->   year
xij variables:
                          educ75 educ81   ->   educ
                      gdhlth75 gdhlth81   ->   gdhlth
                          marr75 marr81   ->   marr
                      slpnap75 slpnap81   ->   slpnap
                      totwrk75 totwrk81   ->   totwrk
                      yngkid75 yngkid81   ->   yngkid
-----------------------------------------------------------------------------

Panel variable: id (strongly balanced)
 Time variable: time, 0 to 1
         Delta: 1 unit