Final Project Information (optional)
The final project is a data analysis project about whatever data excites you. No matter the topic, you will formulate a key research question, find data on that question, answer the question using the tools of the course, and present those results for public consumption.
Here is a list of milestones that we will have to keep you on track:
- Milestone 1: Find a data source and write a short proposal on your research question (4/9)
- Milestone 2: Add one polished data visualization and results from one set of analyses (4/16)
- Milestone 3: Final report (5/7)
Milestone 1: Finding data and writing a proposal (due 4/9)
Finding a data source
The biggest part of the final project is finding a data source. If you want to utilize data from TPD, feel free to do so; I would be happy to help tidying up the data and analyzing those! You may also find more cleaned data from some other resources. Here are examples:
- List of links to political science data sets
- Harvard Dataverse - Social Science
- Data.gov - Data sets released by the US government
- Data published by FiveThirtyEight
- Pew Research Center Data Sets
If you find a data set that you think is interesting, but you have problems loading the data set into R, I am happy to help you loading the data. R can certainly load almost any data.
General advice for choosing data sources
- If you want to analyze the relationship between X and Y, make sure that these two variables are included in the data set. If you want to look at effects for subgroups, make sure there is a variable that you can use for subsetting.
- Try to look for a ‘codebook’ or some other document that explains what the variables mean and how they are coded.
- For most projects, preparing the data for analysis takes longer than the actual analysis itself. Try to find a data set where you do not need to extensively recode / clean up the data before you run your analyses, this makes the final project easier.
- In similar vein, if the data set is greater than about 50MB (this is not a hard cutoff), R commands and analyses tend to take longer.
- Data from experiments is usually simple to analyze, since the analysis commonly involves simple comparisons of group means.
Writing a proposal
You should write a one-paragraph note to describe what data set you will use and what your tentative research question is. Your research question should ask how one dependent variable is related to one or more independent variables. That is, your research question should be able to be answered by a regression analysis. In this paragraph, you should do the following:
- State your research question.
- Formulate a hypothesis related to the research question. This hypothesis should be rooted in some sort of theory. In other words, you need to present a plausible story why the hypothesis might be true. Often, this is in the form of a behaviorial or institutional explanation. As social scientists, we are not interested in idiosyncratic explanations; we want to understand systematic patterns and relationships!
- Describe your explanatory variable(s) of interest and how it is measured. Importantly, we need to observe variation in this variable in order to study it!
- Describe your outcome variable of interest and how it is measured.
- What observed pattern in the data would provide support for your hypothesis? More importantly, what observed pattern would disprove your hypothesis?
Note that you are not fully committing to any specific question or data in this exercise. If you want to change data or questions later, that is fine. This is just a milestone to keep you on track.
Milestone 2: Data visualization and analyses(due 4/16)
Loads the data you have selected and produces one interesting and polished data visualization. This could either show the distribution of one variable or the relationship between two variables. Also produce one analysis that attempts to answer your research question.
Final step: Write up final report (due 5/7)
The final report will include the following sections: (1) an introduction where you introduce the research question and hypothesis and briefly describe why it is interesting; (2) a data section that briefly describes the data source, describes how the key dependent and independent variables are measured (e.g., a survey, statistical model, or expert coding), and also produces a plot that summarizes the dependent variable; (3) a results section that contains a scatterplot, barplot, or boxplot of the main relationship of interest and output for the main regression of interest; and (4) a brief (one paragraph) concluding section that summarizes your results, assesses the extent to which you find support for your hypothesis, describes limitations of your analysis and threats to inference, and states how your analysis could be improved (e.g., improved data that would be useful to collect).
For the data section, you should note if your research design is cross-sectional (most projects will be of this type) or one of the other designs we discussed (randomized experiment, before-and-after, differences-in-differences). For the results section, you should interpret (in plain English) the main coefficient of interest in your regression. You should also comment on the statistical significance of the estimated coefficient and whether or not you believe the coefficient to represent a causal effect.