Main takeaway
- Know your data
- How it was born,
- What its main advantages are
- What its main disadvantages are
- Data quality determines the results of your analysis
- Data quality is determined by how the data was born, and how you are planning to use it
Levy Economics Institute
September 18, 2024
As Economist, we are more familiar with a specific data structure:
Aside from “format”, data can be structured in different ways:
Multi-dimensional: Panel data is of particular interest in economics:
Not all data is created equal
README.txt
that describes where dataset comes fromVARIABLES.xls
that provides basic information on your variables (cookbook)How was the data Born?
If you ask (few), they may answer
Perhaps one can collect data on all observations we want (the population)
but, more often we don’t because it’s impractical or prohibitively expensive
Sampling is when we purposefully collect data on a subset/sample (\(<100%\) coverage) of the population
Sampling is the process that selects that subset (How do we select the sample?)
Different:
Same:
During Data collection, be aware of ethical and legal constraints, Special care with sensitive information
More with web scraping…
Always communicate with the source owner(s) and or with legal professional if you are planning to use seemingly sensitive data (names, addresses, etc.)
Lesson: AI is a tool, not a replacement
From raw to tidy
Does immunization of infants against measles save lives in poor countries? Use data on immunization rates in various countries in various years from the World Bank. How should you store, organize and use the data to have all relevant information in an accessible format that lends itself to meaningful analysis?
You want to know, who has been the best manager in the top English football league. Have downloaded data on football games and on managers. To answer your question you need to combine this data. How should you do that? And are there issues with the data that you need to address?
0
or 1
variables: 0
for no, 1
for yes.
is_female
, is_head_of_household
, is_pregnant
, is_employed
missing_age
, missing_income
, in_sample
Taking care of your data
Data wrangling is the process of transforming raw data to a set of data tables that can be used for a variety of downstream purposes such as data analysis.
A useful concept of organizing and cleaning data is called the tidy data approach:
Advantages:
hotel_id | price | distance |
---|---|---|
21897 | 81 | 1.7 |
21901 | 85 | 1.4 |
21902 | 83 | 1.7 |
Source: hotels-vienna data
. Vienna, 2017 November week-end.
Each Row a new observation, Each Column a new Variable
xt
data so that One row is one it
observation (Cross-section unit i
observed at time t
). (Long format)ijt
data)Country | imm2015 | imm2016 | imm2017 | gdppc2015 | gdppc2016 | gdppc2017 |
---|---|---|---|---|---|---|
India | 87 | 88 | 88 | 5743 | 6145 | 6516 |
Pakistan | 75 | 75 | 76 | 4459 | 4608 | 4771 |
Source: world-bank-vaccination
data
Wide format of country-year panel data, each row is one country, different years are different variables.
imm
: rate of immunization against measles among 12–13-month-old infants.
gdppc
: GDP per capital, PPP, constant 2011 USD.
Country | Year | imm | gdppc |
---|---|---|---|
India | 2015 | 87 | 5743 |
India | 2016 | 88 | 6145 |
India | 2017 | 88 | 6516 |
Pakistan | 2015 | 75 | 4459 |
Pakistan | 2016 | 75 | 4608 |
Pakistan | 2017 | 76 | 4771 |
Note: Tidy (long) format of country-year panel data, each row is one country in one year.
imm
: rate of immunization against measles among 12–13-month-old infants.
gdppc
: GDP per capital, PPP, constant 2011 USD. Source: world-bank-vaccination data.
reshape
* From wide to long
ren *, low // <- Make sure your variables are all lower case
reshape long imm gdppc, /// <- Make variable Long, and indicate what variables to "make" long
i(country) j(year) string // <- also the dimension that was previously "wide" Year
* From long to wide
reshape wide imm gdppc, /// <- Make variable Long, and indicate what variables to "make" long
i(country) j(year) // <- also the dimension that was previously "wide" Year
reshape
only 1 variable, and keep the rest as they are.
merge
bysort hid: egen head_educ = max(educ*(is_head==1))
merge
/join
/link
/match
tables as needed.Review the example, Specially if interested in Futbol
In short, Data can have different structures (all tidy)
Some structures are more useful than others.
Understanding those structures will allow you to work with the data
tucaseid
and tuactivity_n
tucaseid
tulineno
tucaseid
tucaseid
tulineno
tucaseid
There are 4 3 types of merging, depending of the master
or using
dataset
1:1 merging
: Both master and using datasets are uniquely by the same variables. use atus-rost merge 1:1 tucaseid tulineno using atus-cps1:m merging
: Each observation in the master
file will be merge with many
units in the using
dataset. Master has unique ID. use atus-resp merge 1:m tucaseid using atus-actm:1 merging
: Many observations in the master
will be merge with one
unit in the using
. Using has a unique IDm:m merging
: its wrong…dont do it. Perhaps think joinby
insteadImportant
_merge
that will tell you what happened to the merge.Tidying up your tidy data
With most data, in addition to understand it, you need to “clean”, before using it (Very Important)
duplicates report
in Stata, or bysort ID: gen dup = _n
isid
to check if a variable(s) is a unique identifierTeam ID | Unified name | Original name |
---|---|---|
19 | Man City | Manchester City |
19 | Man City | Man City |
19 | Man City | Man. City |
19 | Man City | Manchester City F.C. |
20 | Man United | Manchester United |
20 | Man United | Manchester United F.C. |
20 | Man United | Manchester United Football Club |
20 | Man United | Man United |
Stata
dot “.”, an empty space “” are missing for numbers and strings.Depends on: Scope: How much missing? and Reason: Why missing?
Two basic options:
Consider the data quality, and the data collection process.
Understand the data generating process.
This is an iterative process, and you may need to go back and forth between data cleaning and data analysis.
More on this with EDA and Data Visualization
summarize
, tabulate
, edit
, browse
)
If given the right instructions, and information, AI can help you with data wrangling:
But, it is not perfect. You need to understand the data and the process. Review and control.
Rios-Avila and Cia