The final paper is designed to be the culmination of your learning throughout the semester. You will write an original research paper centered around a multiple regression analysis. In this paper, you will:
● Ask a research question
● Construct a short but dense literature review
● Narrow your research question down into a specific testable hypothesis or
● Find and clean an appropriate data set
● Run a regression analysis
● Explain the results of the regression, including which variables are statistically
significant, which have the largest effects, and any limitations in extrapolating from the results.
The organization of your final paper should follow the outline below.
300 to 500 words ( 1.5 pages max)
The short introduction frames your paper, providing the focus and rationale for your research. The introduction needs to pique the reader's interest, and – even more importantly! – provide big-picture context for why we care about your results. It must be short, a page or page and a half (double-spaced) at most.
In the introduction’s final paragraph or two, mention the data set you used and the specific hypothesis or hypotheses you tested. Your concluding sentences to the introduction should -- very briefly! -- tease the results of your analysis. Conclude the intro with either a one or two sentence summary of the results, or with a sentence that hints at but doesn’t fully reveal what you found.
700 to 1000 words ( 3 pages max)
The literature review is a crucial part of the paper. Like those of other regression analysis papers, your lit review should be short -- only a few pages in length -- but quite dense.
The goal of the lit review is to explain existing scholarly knowledge on your chosen topic. It helps us give us a fuller picture of why we care and what you are adding to our understanding. Just as importantly, the lit review helps build informed expectations about the results of your regression analysis.
20 to 25 academic references are typical. However, d o not s pend an equal amount of time on all of these recent references!
Identify the 4-6 pieces of research that are either foundational to the work that you’re trying to do, or the recent studies most similar in methods and data. Spend the most time and space on these most important studies.
Other scholarly references can be summed up in a sentence or two, often in sentences and paragraphs that group them with similar studies. For example: “Several recent studies using field experiments have found that social pressure mail increases voter turnout (Smith 2012, Jones 2019, Wu and Hahn 2020).”
Google scholar is your friend in constructing this literature review efficiently and comprehensively. P rotip: c opying references directly from google scholar into your reference list is just fine, and will make this much faster.
300 words ( 1 page max, possibly shorter)
Explain the relationship(s) you expect to find with your core regression analysis. Remember: a hypothesis is a directional relationship between two or more variables!
Here's example:
H1: Controlling for other factors, I expect that individuals who live in urban areas will be have a larger increase in turnout after receiving the mail treatment than who live in rural areas.
Make sure that your hypotheses are connected to the findings from the literature review! The literature review should flow naturally into your hypotheses.
Data and methodology
Length varies
From here, the format of the final paper follows the regression template closely.
Explain the data set you used. Tell us where you found the data, who collected it, when the survey was in the field (if applicable), etc. etc.
Tell us briefly about any data cleaning and/or variable creation you had to do.
Focusparticularlyonyourk eyvariablesofinterest.Thatincludesthedependent variable, and the explanatory variables you hope to examine with your hypothesis above. Make sure we understand how the variable is coded, after you have done any data wrangling or data cleaning necessary.
Finally, do a s umm command with the list of all of the variables you include in at least one of your regressions. Copy those results directly into your paper
Length varies
Now, perform the regressions. Copy and paste the results from the regression directly into your final paper (resized and in a fixed-width font like Courier).
Your interpretation of the regression is just as important as getting the Stata commandright!S ospendthetimetocarefullytelluswhattheseresultsmean. Which variables in your regression are statistically significant? Which variables are in your judgment substantively significant, with X having a big affect on Y?
Gentle reminder: in considering which variables are substantively significant, you have to consider the range of possible X values. For example, the coefficient for age may be small, but adults can range in age from 20 to 100.
For students who are replicating previous work, make sure that you try out several different regression models, explaining if the results differ when including slightly different sets of variables.
300-400 words
In about a page, sum up what we can learn from your regression. Emphasize the most important findings and strongest relationships in the regression. Tell us whether the evidence was consistent with your hypotheses.
The conclusion is also the place to explain the limitations of your results, or any worries we should have about generalizing from them. Remember the discussions of regression mistakes in the Allison text and the Wheelan book -- are any of these likely to be a worry for your analysis? For example, is there a potentially important variable you would have liked to include in this analysis that was not present in the data set? Only discuss those you think are relevant.
If you choose, this is also the place to discuss where you think future work on the topic needs to focus.
A final caution: don’t oversell or undersell your results! A successful final project is not one that answers every question with finality. Instead, good papers explain both what a data set can tell us, and what it CANNOT tell us. It’s ok to leave key questions for different studies and data sets.
We’re looking for 20 to 25 references here.
No specific format is required for the references, but APA format and Chicago format are easiest.
Appendix: .do file
Include your full .do file as an appendix. Copying and pasting it
Template for Basic Regression Analysis
Version 1.1
STEP ONE:
Good research starts with good questions—preferably questions that you, yourself find interesting!
In this class, often the research question will be given to you. In other classes or other contexts, figuring out good research questions usually starts with a literature review. If you are NOT given a research question, start a lit review by examining 2-3 high-profile, recent articles on the topic, paying special attention to their discussion and framing of recent research. Look at the most cited pieces in your area of interest, using tools such as Google Scholar or the Social Science Citation Index. Build outward from there.
You need to express your research questions in terms of hypotheses. A hypothesis is the expected directional relationship between two or more variables. For example: “I expect that more years of education will be associated with higher income, even controlling for other factors.”
In two or three sentences, before you start looking at the data, write down what you expect to find.:
● Explain what your hypotheses are.
● Tell us which of the explanatory variables are most important to test your
● Explain which control variables are likely to be most important, to prevent
other effects from confounding the relationship. (NOTE: Think especially about common demographic and socioeconomic variables: age, income, education, race/ethnicity, gender, locality, etc.)
STEP TWO:
Data Check: Look At Your Data—Directly!
The first thing you should always do after opening up a data set for the first time is to look at your data! This is very important. Open the data editor in browse mode; it should look like a spreadsheet. Scroll through, both up and down, and left to right.
Look at the data both with the value labels (if any) shown, and with them hidden (Data Editor > Value Labels). Remember that Stata denotes missing data with a period (e.g. “ . “).
OUTPUTS FOR STEP TWO Answer the following questions:
● Do you see any missing values?
● Do the values of the variables make sense?
● How many observations do you have?
STEP THREE:
Data Cleaning and Creating New Variables (If Needed)
You began by listing the variables you needed, above. Does the data set provide all of these in usable form? Often you will need to create one or more new variables.
In creating new variables, think especially about:
● Turning categorical data into dummy variables -- or use the i. or ib.
commands to do this automatically. If you do, make sure that you make the right choice about the base category.
DO NOT make unnecessary dummy variables! Don’t take an income variable and turn it into a rich / poor dummy. Don’t take an age variable an turn it into a young /old dummy, etc. This throws
away information and makes your regression perform worse.
● Rescaling variables that are non-normal or highly skewed (e.g. taking the log
of income -- discussed later in the course)
● Creating interaction variables (discussed later in the course)
● Adding squared versions of variables if you expect nonlinear relationships
In some data sets, you may have so-called “string” variables, where the answer is actually text instead of a numeric value. These have to go! The gen, replace, and recode commands are your friends. Also: the encode command (particularly with the force option if needed) is a good way to make a string variable numeric.
For example, you might have a simple party preference variable that has three categories (Democrat | Independent | Republican). Usually this is a label for an underlying numerical value. Turn off the labels using the options on the data viewer to see the underlying data. Or alternatively, using the command line, you can use the nolabel option to show the underlying numerical value. For example:
● tab partypref
● tab partypref, nolabel
In the most common case, the real variable is stored as a number. So for example, a variable might have three categories: Democrat = 1, Republican = 2, and Independent = 3. (NOTE: It’s actually better to do a FIVE category scale with partisan leaners, but for this example we’ll keep it simple.)
● gen partypref2 = 0
● replace partypref2 = 1 if partypref == 1 ● replace partypref2 = -1 if partypref == 2 ● summ partypref2
● tab partypre2
More rarely, you will find cases where there is no underlying number, and the real “value” is just a text string. In this case you might want to convert it to a number variable like so:
● gen partypref2 = 0 ● replace partypref2 ● replace partypref2 ● summ partypref2
● tab partypref2 Even easier:
= 1 if partypref == “Republican” = -1 if partypref == “Democrat”
● encode partypref, generate(partypref2)
.... And then replace the values so that -1 = Democrat, 0 equals Independent, etc. In both cases above, partypref2 is now a -1 | 0 | 1 variable, and can be included in the regression. Make sure to note the double equal signs == and the quotation marks. Make sure the capitalization matches. Explain and write down very briefly what you’ve done. Before moving on, check that the variable creation worked by both 1.) browsing with the data editor and 2.) using the summ and tab (for discrete data) commands.
● Note any variables you had to clean or recode.
● Did you need to create any variables?
● Copy and paste the Stata code you used for cleaning and variable creation.
Summarize the Key Variables
Next, use the summ command to summarize the characteristics of all of the variables. What are their means and standard deviations? Their maximum and minimum values? This is especially important for your dependent variables, and for important explanatory variables. But you should do this for all of the control variables, too! As always, cut and paste the results into your assignment. Make sure that we know the mean and s.d. for all of the variables used in the regression.
summ works great with both continuous variables, and also with dummy (0|1) variables. But for variables with discrete categories (e.g. if the value is 0, 1, 2, or 3) it can be misleading. For these variables, use the tab command to show us how the data is distributed (how many 1s, how many 2s, how many 3s, etc.)
● Copy and paste the output of the summ command for all variables used in
the regression.
● If any of your key explanatory variable(s) have just a few values, copy and
paste the output of the tab command as well.
Present the Full Regression Model (or Models)
Next: run the regression, and then present the regression model, with both the theoretically interesting explanatory variables, and the full set of control variables.
Note that in most assignments you want 8-12 variables or sets of variables (counting sets of dummies as one variable). Two few and you aren’t getting the advantages of multiple regression.
In writing up your results, you always need to mention three things:
● Are the coefficients in the expected direction (positive, negative)? For example, if you expected those with more education to have higher income, is that what you find? Compare with your initial expectations above.
● Which variables are statistically significant?
● Which variables are substantively significant—i.e., which have the largest estimated effects?
Note that in many cases, it makes sense to present multiple versions of the same model. This is especially true if:
● You have several different measures of a variable of interest. You can’t put them all in at once, since they are likely highly correlated: use them one at a time in different models.
● You have different dependent variables to examine. (e.g. If you’re interested in Internet use, you might have one regression on email use, one on facebook use, one on twitter use, etc.)
Explain: what portion of the total variance does the model explain (hint: R-squared)? What is the “average miss” of the model on the in-sample data (i.e. root-MSE)? We’re less concerned about R-squared and RMSE with hypothesis
testing than when we’re trying to make accurate predictions, but these stats should always be reported.
● Run the regression, and copy and paste the regression tables.
○ Unless the assignment specifies otherwise, use a logit for dummy/binary dependent variables.
● Explain how the regression performed overall, with reference the R or R-squared and the RMSE.
● For the regression, explain the regression results. For your key variables only, discuss whether:
○ the coefficients where in the expected direction (positive, negative)
○ whether the coefficients were statistically significant.
○ whether the coefficients are substantively significant. What counts as a
large effect is up to your own judgment! Just show that you understand what the results are telling you.
Explain the Limits of the Regression Analysis
Conclude by explaining the limits of the analysis. What are the potential concerns? Mentally walk through the list of regression problems featured in Ch. 2 of the Allison text. List—and explain—concerns that might be relevant here. Especially think about:
● Possible missing explanatory variables
● Reverse causation or selection bias
● Measurement error, in both the dependent and explanatory variables
● Sample bias. Is the sample a good reflection of the relevant population? If not how might the relationship between the variables be skewed?
● Sample size issues (both too big, where trivial effects may be statistically significant, or too small, which requires caution on several fronts)
● Strong correlation among two or more explanatory variables.
● Mediating variables, which may present problems of interpretation. These could be between two or explanatory variables (the family income -> GPA -> SAT scores examples). It could even be a third variable that affects both x and y. Only discuss factors that you think are relevant.
● Discuss and explain any of the above issues that you have reason to be
concerned about, ignoring those you don’t think are relevant.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme