Levy Economics Institute
August 23, 2024
Once upon some data
5 reasons to do EDA:
All and all, EDA should help you identify some of the key features of the data and how they relate to each other.
Look at key variables:
Describe what you see:
The frequency or more precisely, absolute frequency or count, of a value of a variable is simply the number of observations with that particular value.
The relative frequency is the frequency expressed in relative, or percentage, terms: the proportion of observations with that particular value among all observations.
Can also use Probabilities: the relative likelihood of a value of a variable.
How to? tab
ulate [variable]
, or better yet fre [variable]
(SSC install)
A key part of EDA is to look at (empirical) distribution of most important variables.
Histogram reveals important properties of a distribution:
Histograms in Stata
are created with the histogram
command.
histogram [variable] [if in] [fweight], [bin(#) width(#) discrete] ///
[density] [frequency] [fraction]
You can only create the histogram of one variable at a time. (unless combined)
and you can determine how “fine” or “coarse” the histogram is. (A bit of art)
kdensity
in Stata.city_actual
. Tabulate. Realise few hotels are not in Vienna. Drop them. N=207Central tendency
\[\bar{x} = \frac{\sum x_i}{n}\]
where \(x_i\) is the value of variable \(x\) for observation \(i\) in the dataset that has \(n\) observations in total. Two key features:
The median is another statistic of central tendency. It indicates the middle value of the distribution. Its a special case of quantiles.
quantiles: a quantile is the value that divides the observations in the dataset to two parts in specific proportions.
\[Q_\tau(Y) \rightarrow \frac{1}{N}\sum I(y<Q_\tau) = \tau \]
The median and 25th and 75th percentiles are the most common quantiles used in EDA.
Spread and Shape
There are three common measures of ranges:
The most widely used measure of spread is the standard deviation, and Its square is the variance.
\[ \begin{aligned} Var[x] &= \frac{\sum (x_i - \bar{x})^2}{n}=S^2_x \\ Std[x] &= \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}}=S_x \end{aligned} \]
Unit Free alternative, Coefficient of variation:
\[CV = \frac{Std[x]}{\bar{x}} \]
The SD is often used to re-calculate differences between values in order to express them as typical distance.
\[x_{standardized} = \frac{(x - \bar{x})}{Std[x]} \]
There are two common measures of skewness:
\[ Sk^1 = \frac{(\bar{x} - median(x))}{Std[x]} \text{ and } Sk^2 = \frac{\sum(x_i-\bar x)^3}{Std[x]^3} \]
Two basic options to get summary statistics in Stata:
summarize
command: provides basic statistics for all variables in the dataset. Include detail
option for more statistics.tabstat
command: provides more flexibility. You can choose which statistics to show and for which variables.estpost
to store the results and create well formatted tables.Theoretical distributions are distributions of variables with idealized properties.
Theoretical distributions are fully captured by few parameters: these are statistics determine the whole distributions
For example, the normal distribution is fully captured by two parameters: the mean and the standard deviation.
They may not accomodate empirical data
Theoretical distributions can be helpful:
Number of observations (_N) was 0, now 10,000.
(bin=40, start=.79124224, width=1.5880267)
Data Viz
scale()
function)See DataViz for a guide of how to create graphs in Stata.
AI is very good at describing the data, if you give it the tools (data)
Pretty good with python
, but less proficient with Stata
for complex graphs.
Still good to have someone to ask without judgement.
Answering this question may help in benchmarking management practices in a specific company, assessing the value of a company, or estimating the potential benefits of a merger between two companies.
To answer this question you downloaded data from the World Management Survey.
qui: use "data_slides/hotels-vienna-london", clear
drop if price > 1000
set scheme white2
color_style tableau
two (kdensity price) ///
(kdensity price if city=="Vienna") ///
(kdensity price if city=="London"), ///
legend(order(1 "All" 2 "Vienna" 3 "London")) ///
xtitle("Hotel Prices") xsize(9) ysize(6)
(395 observations deleted)
If there is a Conditional distribution, there is a conditional statistic.
Interviews by CEO/senior managers, based on that a score is given. Average across different domains.
Normalized - standardized score
Firm size: Consider three bins: small (100–199), medium (200–999), large (1000+)
mean | p50 | sd | |
---|---|---|---|
Small | 2.68 | 2.78 | 0.51 |
Medium | 2.94 | 3.00 | 0.62 |
Large | 3.19 | 3.08 | 0.55 |
Total | 2.94 | 2.94 | 0.60 |
Observations | 300 |
Small | Medium | Large | Total | |
---|---|---|---|---|
1 | 19.44 | 8.33 | 6.94 | 10.67 |
2 | 37.50 | 28.85 | 26.39 | 30.33 |
3 | 31.94 | 35.90 | 30.56 | 33.67 |
4 | 11.11 | 21.79 | 27.78 | 20.67 |
5 | 0.00 | 5.13 | 8.33 | 4.67 |
Total | 100.00 | 100.00 | 100.00 | 100.00 |
N | 300 |
scatter
plot is a two-dimensional graph with the values of each of the two variables measured on its two axes.sort emp_firm
qui:drop2 emp_firm_bin emp_mean_bin
xtile emp_firm_bin = _n, n(20)
bysort emp_firm_bin: egen emp_mean_bin=mean(emp_firm)
bysort emp_firm_bin:egen mean_mng=mean(management)
scatter mean_mng emp_mean_bin, xtitle("Firm size") ytitle("Management score") ///
scale(1.5) legend(off) ylabel(1/5) ///
note("Using 20 bins")
variable emp_firm_bin not found
variable emp_mean_bin not found
Correlation, NOT causation
\[E[y|X=x_1] \neq E[y|X=x_2]\]
The formula for the covariance between two variables \(x\) and \(y\) with n observations is:
\[Cov[x, y] = \frac{1}{n}\sum_i (x_i - \bar{x})(y_i - \bar{y}) \]
\[Corr[x, y] = \rho_{xy}= \frac{Cov[x, y]}{Std[x]Std[y]}\]
\[-1 \leq Corr[x, y] \leq 1\]
Note
If two variables are independent, they are also mean-independent, Thus \(E[y|x] = E[y]\) of any value of x.
Heteroskedasticity
)Industry | Correlation | Observations |
---|---|---|
Auto | 0.50 | 26 |
Chemicals | 0.05 | 69 |
Electronics | 0.33 | 24 |
Food, drinks, tobacco | 0.05 | 34 |
Materials, metals | 0.32 | 50 |
Textile, apparel | 0.29 | 43 |
Wood, furniture, paper | 0.28 | 29 |
Other | 0.44 | 25 |
All | 0.30 | 300 |
Sight beyond Sight
Alternatives:
\[\bar{z_i} = \frac{1}{k}\sum_{j=1}^k z_i^j \text{ or } \bar{z_i} = \frac{\sum_{j=1}^k w_j \times z_i^j}{\sum_{j=1}^k w_j} \]
All should be measured in the same Scale. Simple and a natural interpretation
You can also use weights to give more importance to some variables than others.
Or can use sub-groups indices to create a composite index.
Some times you may need to use other methods to combine variables. Machine learning methods!
A Correlation analysis could also be useful to compare the two measures.
Not all variables are created equal
Variation in the conditioning variable is necessary to make comparisons.
Example: to uncover the effect of price changes on sales you need many observations with different price values.
Generalization: The more variation is there in the conditioning variable the better are the chances for comparison.
However, There are -advanced- methods that can help identify causal relationships in observational data. (Advanced Econometrics)
The LOG transformation
\[ln(x + \Delta x) - ln(x) \approx \frac{\Delta x}{x}\]
\[\begin{aligned} ln(a) - ln(b) &\approx \frac{a-b}{0.5(a+b)} \\ ln(1.01)-ln(1) &= 0.0099 \approx 0.01 \\ ln(1.1)-ln(1) &= 0.095 \approx 0.1 \end{aligned} \]
Rios-Avila and Cia