Basic Regression Analysis
Time series “works” different from Repeated crossection.
You do not have access to a random sample. (Window of time if fixed)
You have access to a single “random” time line
And in time series, one needs to be quite aware that Data has Baggage…What you see today is the product of everything that happens in the far past.
This is what we call Past Dependent, or simple serial correlation.
Data cannot not be arbitrarily reordered. (Past affect future)
Randomness of the data comes from the uncertainty of shocks that affects a variable over time, not from sampling.
Your “Sample” is one realized path that you observe in a narrow window of time.
Because observations are no longer independent, we will need to worry about correlation across time.
In fact, because data may be strongly correlated across time (say your age), it may generate some problems when applying OLS.
So, we must learn “new” tools to deal with this problem.
\[GDP_t = a_0 + a_1 educ_t + a_2 Invest_t + a_3 Unemp_t + u_t\]
Education, investment and Unemployment rate are assumed to affect GDP contemporaneously. But Lags of Leads of the data has no effect on GDP.
\(fr_t\): Fertility Rate; \(te_t\): Tax exemption
This is an FDL model with 2 lags.
\[y_t = a_0 + \sum_{k=0}^q \delta_k z_{t-k} + e_t\]
\[ \text{Wrong: } y_t = a_0 + \sum_{k=0}^{\infty} \delta_k z_{t-k} + e_t\] \[ \text{Better: } y_t = a_0 + \sum_{k=0}^{\infty} \gamma \delta^k z_{t-k} + e_t\]
\[\begin{aligned} y_t &= a_0 + \gamma z_t + \gamma \rho z_{t-1} + \dots + e_t \\ y_{t-1} &= a_0 + \gamma z_{t-1} + \gamma \rho z_{t-2} + \dots + e_{t-1} \end{aligned} \]
\[y_t = \rho y_{t-1} + a_0 (1-\rho) + \gamma z_t + v_{t}\]
Which requires really strong assumptions!
\[\text{Short}\frac{\partial y_t}{\partial z_{t-k}}=\gamma \rho^k \text{ and } \text{Long}\frac{\partial y_t}{\partial z}=\frac{\gamma}{1-\rho}\]
Because IDL imposes strong assumptions on coefficients, we can relax them by allowing for lags. This is called the RDL model. \[y_t = a_0 + \gamma_0 z_t + \gamma_1 z_t +\delta y_{t-1} + e_t- \rho e_{t-1}\]
Which has the following short and long effects:
\[\text{ Short:}\frac{\partial y_t}{\partial z_t} = \gamma_0 \] \[\text{ Short:}\frac{\partial y_t}{\partial z_{t-k}} = \rho^{k-1}(\rho \gamma_0 + \gamma_1) \] \[\text{ Long:}\frac{\partial y_t}{\partial z} = \frac{\gamma_0 + \gamma_1}{1-\rho} \]
At least for M1 and M2
A1. Linear in Parameters: Same old, same old, \(y_t = \beta_0 + \beta_1 x_{1t} + \dots + \beta_k x_{kt} + u_t\)
A2. No Perfect Collinearity: Also Same old, same old
\[X=\begin{pmatrix} x_{11} & x_{12} & \dots & x_{1k} \\ x_{21} & x_{22} & \dots & x_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ x_{T1} & x_{T2} & \dots & x_{Tk} \end{pmatrix} \]
A3. Zero Conditional Mean
\[E(u_t|X)=0 \]
So that \(X\) is strictly Exogenous (across all possible times).
Not only \(x_t\) should not be affected by \(u_t\), but neither should \(x_{t-1}\) nor \(x_{t+1}\)
A1-A3 will guarantee that OLS is unbiased.
A4: Strong Homoskedasticity
\[Var(u_t|X)=\sigma^2 \]
A5: No Serial Correlation (Correlation across time of the errors)
\[Corr(u_t,u_s|X)=0 \text{ for all } t\neq s \]
Also difficult to fulfill, because unobserved may have inertia, and depend on past values.
Nevertheless, A1-A5: Standard errors can be estimated using the usual formula:
\[\begin{aligned} \hat{Var}(\hat{\beta}) &= \hat{\sigma}^2(X'X)^{-1} \\ \hat{Var}(\hat{\beta_k}) &= \frac{\hat{\sigma}^2}{SST_k(1-R^2_k)} \\ \hat \sigma^2 &= \frac{1}{T-k-1}\sum_{t=1}^T \hat{u}_t^2 \end{aligned} \]
Which are BLUE! (Best Linear Unbiased Estimators)
A6: Normality, The \(\beta\)’s are normally distributed, and F-tests and t-tests are valid.
Model: \(i_t = \beta_0 + \beta_1 inf_t + \beta_2 def_t + u_t\)
A1: \(\checkmark\) (but questionable)
A2: \(\checkmark\) (almost never a problem)
A3: NO! Deficits and inflation today may affect adjustments in the future (\(u_{t+1}\)), Similarly, \(u_t\) may have to be adjusted in the future using Deficits and inflation.
A4: Perhaps? Usually there is a direct relationship between deficit and uncertainty, which will generate heteroskedasticity.
A5: NO! There could be many things in \(u_t\) that are correlated across time. (taxes?)
A6: NO…the errors are almost never normal
Source | SS df MS Number of obs = 56
-------------+---------------------------------- F(2, 53) = 40.09
Model | 272.420338 2 136.210169 Prob > F = 0.0000
Residual | 180.054275 53 3.39725047 R-squared = 0.6021
-------------+---------------------------------- Adj R-squared = 0.5871
Total | 452.474612 55 8.22681113 Root MSE = 1.8432
------------------------------------------------------------------------------
i3 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
inf | .6058659 .0821348 7.38 0.000 .4411243 .7706074
def | .5130579 .1183841 4.33 0.000 .2756095 .7505062
_cons | 1.733266 .431967 4.01 0.000 .8668497 2.599682
------------------------------------------------------------------------------
\[FRate_t = 98.7 + 0.08 PE_t - 24.24 WW2 - 31.6 Pill_t+ e_t\]
Very Similar to what was done in Cross Sectional Models.
Using Logs of the Dep variable changes the interpretation of the coefficients.
\[\Delta log(x)\simeq \%\Delta x\]
reg log_gdp year
The coefficient of year
should give you the average growth rate of GDP.
Trends are very common in time series data.
Ignoring this may cause problems, as one may identify spurious relationships. (things that look to have significant effects, even tho they are not related)
Consider the following model (investment on housing, and housing prices):
Source | SS df MS Number of obs = 42
-------------+---------------------------------- F(1, 40) = 10.53
Model | .254364468 1 .254364468 Prob > F = 0.0024
Residual | .966255566 40 .024156389 R-squared = 0.2084
-------------+---------------------------------- Adj R-squared = 0.1886
Total | 1.22062003 41 .02977122 Root MSE = .15542
------------------------------------------------------------------------------
linvpc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lprice | 1.240943 .3824192 3.24 0.002 .4680452 2.013841
_cons | -.5502345 .0430266 -12.79 0.000 -.6371945 -.4632746
------------------------------------------------------------------------------
\[E(log(invpc_t)|x) = -20.04 -0.38 log(price) + 0.009 year \]
Source | SS df MS Number of obs = 42
-------------+---------------------------------- F(2, 39) = 10.08
Model | .415945108 2 .207972554 Prob > F = 0.0003
Residual | .804674927 39 .02063269 R-squared = 0.3408
-------------+---------------------------------- Adj R-squared = 0.3070
Total | 1.22062003 41 .02977122 Root MSE = .14364
------------------------------------------------------------------------------
linvpc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lprice | -.3809612 .6788352 -0.56 0.578 -1.754035 .9921125
year | .0098287 .0035122 2.80 0.008 .0027246 .0169328
_cons | -20.03976 6.964526 -2.88 0.006 -34.12684 -5.952675
------------------------------------------------------------------------------
One of the consequences of spurious regressions is that the \(R^2\) will be inflated. (caputred by the common trend or seasonality
Even if we add trends or seaonalities, the default \(R^2\) will be too large. (Because it still describes ALL variation)
A better approach to understand the true explanatory power of the model is to use an \(R^2\) that adjusts for trends and seasonality.
\[y_t = \beta_0 + \beta_1 x_{1t} + \beta_2 x_{2t} + \theta \times t + \sum \gamma_k \times D_k + u_t\]
Where \(D_k\) are dummies for seasonality, and \(\theta \times t\) is a trend.
\[\tilde w_t = w_t - E(w_t| t , D_1, D_2, \dots, D_k) \forall w \in {y, x_1, x_2}\]
\[\tilde y_t = \beta_1 \tilde x_{1t} + \beta_2 \tilde x_{2t} + u_t\]
\[aR^2 = 1-\frac{\sum \hat u^2_t}{\sum \tilde y^2_t}\]
Next week, Advanced Time series