Research Methods I

Data Analysis for Economics and Policy

Instructor Information

Instructor: Fernando Rios-Avila
Email: friosavi@levy.org
Office Hours: Wednesdays 1:30pm to 4:00pm. Or by appointment. Other times can be arranged, but will be done remotely.
Course Website:
Class Time: Wednesday, 9:30 am - 12:45 am

Course Description

This course focuses on providing students with the tools and skills necessary to conduct data analysis for economics and policy research. Students will be exposed to the entire process of data analysis, from formulating questions and collecting data to cleaning, exploring, analyzing, and presenting results. The course covers exploratory data analysis, regression analysis, and introduces topics on prediction with machine learning. Students will gain hands-on experience using Stata, with Quarto for reproducible reporting, and GitHub for version control and collaboration.

Course Objectives

By the end of this course, students will be able to:

Apply advanced data analysis techniques to economic and policy questions.
Use modern tools such as GitHub and Quarto for research collaboration and reproducibility.
Formulate research questions and design appropriate data collection methods.
Clean, organize, and explore data using various techniques and visualizations.
Apply regression analysis techniques to analyze relationships between variables.
Use machine learning methods for prediction and classification tasks.
Implement data analysis techniques using Stata.
Effectively communicate research findings through written reports and oral presentations.

Required Textbook

Békés, G., & Kézdi, G. (2021). Data Analysis for Business, Economics, and Policy. Cambridge University Press.

Software Requirements

Stata: A student license will be provided.
Quarto: Free and open-source software for reproducible research.
VSCode: Free and open-source code editor.
GitHub/GitHub-Desktop: Free platform for version control and collaboration.
zotero: Free reference manager.

Important

All homework assignments are required to be submitted in Quarto format, using GitHub to submit the assignments.

Course Outline

1: Introduction to Modern Research Tools

Course overview and expectations.
Introduction to GitHub/Github-Desktop for version control and collaboration.
Getting started with Quarto for reproducible research: RStudio and VSCode.
Other Tools: Overleaf, Zotero
Data organization and management
Lab: Setting up GitHub, Quarto, VSCode, Zotero, and Overleaf

2. Introduction to Data Analysis

Introduction to data analysis: The Process
What is Data? What types of Data are there?
Data collection methods
Preparing data for analysis
Tidy data principles
Data cleaning: Missing values, outliers, and errors
Readings: Chapter 1 and 2

3: Data Exploration

Type of data vs type of analysis
Frequencies, distributions, and summary statistics
Exploratory data analysis techniques and visualizations
Theoretical distributions
Comparisons, correlations and conditional distributions
Latent and observed variables
Readings: Chapter 3 and 4

4: Generalization: From Sample to Population

Sampling and generalization
Repetition and sampling variability
Confidence intervals and standard errors: The Bootstrap method
External validity
Hypothesis testing principles
Type I and Type II errors
Multiple Hypothesis Testing and p-hacking
Readings: Chapters 5 and 6

5 and 6: Regression Analysis I: Simple Regression

Linear and non-linear relationships
Linear Regression: Estimation and interpretation
Correlations and coefficients: Searching for causality
Properties and assumptions of the linear regression model
Transformations and Semiparametric models
Extreme values, Influential observations and measurement error
Generalizing Results: SE and CI
Testing Hypotheses
Readings: Chapters 7, 8, and 9

7: Regression Analysis II: Multiple Regression

Multiple regression basics: Estimation and Inference
Problems with multiple regression
Non-linearities, interactions and qualitative variables
Readings: Chapters 10

8: Regression Analysis III: Modeling Probabilities

Linear Probability Model
Non-linear models: Logit and Probit
Interpretation and marginal effects
Goodness of Fit and Predictive Power
Readings: Chapter 11

9: Time Series Analysis

Introduction to time series data
Trend and seasonality
Stationarity and autocorrelation
Serial correlation
Readings: Chapter 12

10: Prediction

Introduction to Prediction
R2 vs AR2 and other measures of fit
Overfitting and Cross-validation: Finding the right model
External validity and generalization
Readings: Chapter 13

11: Model Building for Prediction: LASSO

The Process of Prediction
How to choose \(g(y)\)
Working with \(X's\)
Introduction to LASSO: Prediction and Diagnosis
Readings: Chapter 14

12: Predicting Probabilities and Classification

Predicting Probabilities and Classification
Classification, confusion matrices, and ROC curves
Finding the right threshold
Readings: Chapter 17

13: Forecasting Data

Introduction to Forecasting: Predicting the future
Training and Testing Data
Trends, Seasonality, and Cycles
Forecasting with ARIMA
VAR and External validity
Readings: Chapter 18

Grading Policy

Weekly Quizzes: 10%
- Short quiz at the end of each class to test material covered.
Weekly Problem Sets: 30%
- Small problem-sets of one or two questions to be completed and submitted in Quarto format using GitHub. This may include, data exploration and visualization, regressions and interpretation, and prediction. It also includes the peer review of another student’s work.
Term Paper: 60%
- Multi-part research project applying techniques learned in class to real data.

Term Paper (60% of final grade)

Throughout the semester, students will work on a multi-part research project that applies the techniques learned in class to real-world data. This project will require students to propose a research question, collect and clean data, conduct exploratory and regression analyses, and present their findings.

Part I: Research Proposal (5%, due Week 2)

Introduction (including research question and motivation)
Proposed data sources: Consider data from the textbook, Kaggle, or other sources
Expected findings and relevance
Conclusion
References (if any)
Specify which software (if other than Stata) you plan to use for your analysis
Create a GitHub repository for your project and submit the link with your proposal

Part II: Data Collection and Cleaning (10%, due Week 4)

Write a report that includes the following:

Description of data sources, collection methods, and potential biases
Data overview and preprocessing steps
Discussion of issues encountered and solutions
Preliminary analysis with descriptive statistics and visualizations
Include a data dictionary in your GitHub repository

Peer Review Report (due Week 5)

Review another student’s work and provide feedback on their data collection and cleaning process. Provide suggestions for improvement and identify any potential issues.

Interim Progress Report (5%, due Week 7)

Brief update on progress, challenges faced, and next steps
Additional visualizations or analyses
It should include a draft of literature review and methodology
- For the literature review include summaries for 3-5 papers related to your research question.
- It should also include a draft of the methodology section, including the model specification.
Address any feedback received from the peer review in Part II

Part III: Data Analysis (15%, due Week 10)

Present a draft for data analysis that includes the following:

Model specification discussion
Regression results presentation
Interpretation of results
Discuss Limitations and robustness checks

Peer Review Report (due week 11)

Review another student’s work and provide feedback on their data analysis. Provide suggestions for improvement and identify any potential issues.

Part IV: Final Report (20%, due Week 13)

Complete research paper should include:

Introduction
Literature Review
Data and Methodology
Robustness Checks or Sensitivity/Sub-group Analysis
Conclusion
References
Appendices (if any)

Presentaton (5%, Week 14)

15-minute presentation of your research to the class
See here for an example for the kind of report expected at each stage, based on the first research proposal on the impact of remote work on urban housing prices.

Additional Requirements

All work should be submitted in Quarto format using GitHub.
Unless Data used is confidential or too large, you should include all data in your GitHub repository.
Your GitHub repository should include with all code, data (if possible), and the Quarto document for your report.
It should also include all papers used for the literature review.
At each stage, you should submit an email with the PDF and QMD files to the instructor, at or before the deadline. You should also push your work to GitHub to follow the progress of your project.

Resources

Textbook: Békés, G., & Kézdi, G. (2021). Data Analysis for Business, Economics, and Policy. Cambridge University Press. There are additional resources available on the book’s website: Data Analysis
Software: There are several statistical packages for analyzing data. In this course, we will be using the software Stata to cover all materials in class. Slides are self-replicable, thus you can copy and paste almost all code provided to replicate the results seen in class. The Institute will be providing you with licenses for Stata/BE for the length of the course.

Stata offers many free short webinars and video tutorials that may be useful if you never used Stata before, or even if you have some experience with it. Please see the resources page for more information.

If you decide to, you can also use R, Julia, or Python to study and work on the course materials and homework. The book for the class has a repository with all the code in Stata, R and Python. It could be of great advantage to you to learn other languages, as they are widely used in the industry and academia.

As with many other skills, the best way to learn is to simply work with the software, work on the book exercises, and ask any questions to me or your classmates when you find a problem you could not find a solution for.

For the additional software, please look into Quarto, GitHub, Zotero and VSCode.

Course Policies

Attendance: Attendance is highly recommended. Classess will not be recorded, and except for exceptional cases, there will be no online classes.
Late Assignments: Late assignments will not be accepted unless prior arrangements have been made with the instructor.
Academic Integrity: All work submitted must be your own. Plagiarism will not be tolerated and will result in a failing grade for the assignment or course.
AI usage: The use of AI in the class is allowed. However, you must disclose any AI tools used in your assignments. AI is a tool you can use to generate ideas, edit your text, provide help with coding, etc. However, it is completely unacceptable to use AI to generate the entire assignment. You will have to be able to explain and defend your work in class.