Pooling Data together: Cross-section and Panel Data
Up to this point, we have cover the analysis of cross-section data.
Many individuals at a single point in time.
Towards the end of the semester, We will also cover the analysis of time series data.
A single individual across time.
Today, we will cover the analysis of panel data and repeated crossection: Many individuals across time.
This type of data, also known as longitudinal data, has advantages over crossection, as it provides more information that helps dealing with the unknown of \(e\).
And its often the only way to answer certain questions.
Pooling independent crossections
We first consider the case of independent crossections.
We have access to surveys that may be collected regularly. (Household budget surveys)
We assume that individuals across this surveys are independent from each other (no panel structure).
This scenario is typically used for increasing sample-sizes and thus power of analysis (larger N smaller SE)
Only minor considerations are needed when analyzing this type of data.
We need to account for the fact Data comes from different years. This can be done by including year dummies.
May need to Standardize variables to make them comparable across years. (inflation adjustments, etc.)
Example
Lets use the data fertil1 to estimate the changes in fertility rates across time. This data comes from the General Social Survey.
Code
frause fertil1, clearregress kids educ age agesq black east northcen west farm othrural town smcity i.year, robust
The Chow test can be used to test whether the coefficients of a regression model are the same across two groups.
we have seen this test back when we were discussing dummy variables.
We can also use this test to check if coefficients of a regression model are the same across two time periods. (Has the wage structure changed across time?)
This is the case of interest here.
Not much changes with before. Although it can be a bit more tedious to code.
One advantage of pooling crossection data is that it could to be used to estimate causal effects using a method known as Differences in Differences (DnD)
Consider the following case:
There was a project regarding the construction of an incinerator in a city. You are asked to evaluate what the impact of this was on the prices of houses around the area.
You have access to data for two years: 1978 and 1981.
In 1978, there was no information about the project. In 1981, the project was announced, but it only began operations in 1985.
we could start estimating the project using the simple model: \[rprice = \beta_0 + \beta_1 nearinc + e\]
using only 1981 data. But this would not be a good idea. Why?
One advantage of DD is that it can control for those unobserved factors that may be correlated with outcome.
Without controls, however, estimates may not have enough precision.
But, we could add controls!
\[ y = \beta_0 + X \gamma + \beta_1 post + \beta_2 treat + \beta_3 post*treat + e\]
But its not as easy as it may seem! (just adding regressions is not a good approach)
This method requires other assumptions! (\(\gamma\) is fixed), which may be very strong.
Note: For DD to work, you need to assume the two groups follow the same path in the absence of the treatment. (Parallel trends assumption)
Otherwise, you are just using trend differences!
Diff in Diff in Diff
An Alternative approach is to use a triple difference model.
Setup:
You still have two groups: Control and Treatment (which are easily identifiable)
You have two time periods: Pre and Post (which are also easily identifiable)
You have a different sample, where you can identify controls and treatment, as well as the pre- and post- periods. This sample was not treated!
Estimation:
Estimate the DD for the Original Sample, and the new untreated sample.
Obtaining the difference between these two estimates will give you the triple difference.
Example: Smoking ban analysis based on age. (DD) But using both treated and untreated States (DDD)
General Framework and Pseudo Panels
One general Structure for Policy analysis is the use of Pseudo Panels structure.
Pseudo panels are a way to use repeated crossection data, but controlling for some unobserved heterogeneity across specific groups. (the pseudo panels)
For Pseudo-panels, we need to identify a group that could be followed across time.
This cannot be a group of individuals (repeated crosection).
But we could use groups of states, cohorts (year of birth), etc.
In this case, the data would look like this: \[y_{igt} = \lambda_t + \alpha_g + \beta x_{gt} + z_{igt}\gamma + e_{igt}\]
Where \(g\) is the group, \(t\) is the time, and \(i\) is the individual.
This model can be estimated by using dummies. (one dummy for each group and time-period)
And \(\beta\) is the coefficient of interest. (impact of the Policy \(x_{gt}\)).
This may ony work if we assume \(\beta\) is constant across time and groups.
Alternative
We could also use a more general model: \[y_{igt} = \lambda_{gt}+ \beta x_{gt} + z_{igt}\gamma + e_{igt}\]
where \(\lambda_{gt}\) is a group-time fixed effect. (Dummy for each group-time combination)
Nevertheless, while more flexible, this also imposes other types of assumptions, and might even be unfeasible if we have a large number of groups and time periods.
Still, we require \(\beta\) to be homogenous. If that is not the case, you may still suffer from contamination bias.
Panel data
Baby steps: 2 period panel data
2-period Panel data
Panel Data, or longitudinal data, is a type of data that has information about the same individual across time.
The simplest Structure is one where individuals are followed over only 2 periods.
The main advantage of panel data (even two periods version) is that it allows us to control for unobserved heterogeneity across individuals.
But only if you want to assume fixed effects are constant across time.
So how does this reflects in the model specification?
Where \(i\) refers to individuals or panel units, and \(t\) refers to time periods.
Also, \(X's\), \(X's\)\(W's\) are variables that vary across individual and time, across time or across individuals.
There are also three types of errors. Those that contains unobserved that vary across individuals \(e_i\), across time \(e_t\), and across individuals and time \(e_{it}\) (Idiosyncratic error).
\(e_i\) is usually referred to as the individual fixed effect, and \(e_t\) as the time fixed effect.
In a 2 period panel, controlling for time-effects is may not be necessary (its just one dummy)
What is more concerning is the unobserved individual fixed effect.
This is pretty similar to the generalized Pooling model we saw before.
How estimation changes
For time use, we assume we control with a single dummy.
Requires \(e_i\) to be uncorrelated with \(x_{it}\) (otherwise is biased), and Standard Errors will need to be clustered at the individual level.
You can aim to estimate all individual fixed effects using dummies (FE estimator). \[y_{it} = \beta_0 + \beta_1 x_{it} + \delta t + \sum \alpha_i D_i + e_{it})\]
Time fixed variables cannot be estimated anymore
You can estimate the model in differences (FD estimator)
Now you have only 1 observation per panel, instead of 2. And the result would be identical to FE estimator.
Example
Code
** This data is inwideformatfrause slp75_81, clear** Lets reshape it so its in standard longformatgen id = _nreshapelong educ gdhlth marr slpnap totwrk yngkid, i(id) j(year)gen time = year==81xtset id time** Regression as Pool Crossectionqui: reg slpnap time totwrk educ marr yngkid gdhlth male, cluster(id)est sto m1** using FEqui: areg slpnap time totwrk educ marr yngkid gdhlth male, absorb(id) cluster(id)est sto m2** using FDqui: regd.slpnap d.time d.totwrk d.educ d.marr d.yngkid d.gdhlth d.male, robustest sto m3
(j = 75 81)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 239 -> 478
Number of variables 21 -> 16
j variable (2 values) -> year
xij variables:
educ75 educ81 -> educ
gdhlth75 gdhlth81 -> gdhlth
marr75 marr81 -> marr
slpnap75 slpnap81 -> slpnap
totwrk75 totwrk81 -> totwrk
yngkid75 yngkid81 -> yngkid
-----------------------------------------------------------------------------
Panel variable: id (strongly balanced)
Time variable: time, 0 to 1
Delta: 1 unit