Name:
You must work alone.
You must submit a SINGLE and READABLE file that includes your work, e.g. important figures/tables/codes/other outputs, to Canvas.
You must submit your work no later than Wednesday, April 29, 11:59pm, to Canvas. Make sure to complete your submission ahead of time. The submission link will be closed automatically after the deadline. No submission or extension will be allowed after the deadline.
For full credit on each question, make sure you follow exactly what is asked, and answer each prompt. The total is 70 points (plus 5 bonus points).
You may only ask me for clarifying questions; you may not ask questions for help. This is a final exam, not a lab assignment.
This should go without saying: do not cheat. It ends poorly for everyone.
1a (4 points). Simulate 1000 pairs \((X_1,X_2)\) according to \[
X_1 \sim N(0,2^2) \quad \text{and} \quad X_2 \sim N(0,1), \quad \text{independently}.
\] (here \(N(\mu,\sigma^2)\) denotes the normal distribution with mean \(\mu\) and variance \(\sigma^2\).) Call this data “group A”, and save the data points in matrix called a.mat
, of dimension 1000 x 2. Simulate 1000 pairs \((X_1,X_2)\) according to \[
X_1 \sim N(2.5,1) \quad \text{and} \quad X_2 \sim N(3,2^2), \quad \text{independently}.
\] Call this data “group B”, and save the data points in a matrix b.mat
, of dimension 1000 x 2. Compute the column means and column standard deviations of a.mat
and b.mat
, and show that they’re close to what you would expect.
1b (4 points). On a single plot, plot \(X_2\) versus \(X_1\), using all the points from groups A and B. Color the points from group A in black, and those from group B in red. Set the x label to “X1”, the y label to “X2”, and leave the title blank. Make sure the axes limits are just big enough to show all the points on the plot, and set these programmatically (do not set these using hard constants). Add a legend to explain the color coding. Do you see significant overlap in the points from the two groups?
1c (3 points). Define a data frame dat
of dimension 2000 x 3, as follows. First, stack the rows of a.mat
and b.mat
together, to get a data frame of dimension 2000 x 2. Then, append a column of 0s and 1s, that indicates whether the points (rows) come from group A or B (with 0 for A and 1 for B). Finally, name the columns of dat
to be X1
, X2
, Y
respectively. Show the first 3 and last 3 rows of dat
.
1d (2 points). Perform a logistic regression of Y
(response) on X1
and X2
(variables), using the data dat
. Use the function glm()
with family="binomial"
, and save the output object as log.mod
. Allow for an intercept in the model (the default). Report the intercept, and the coefficients for X1
and X2
.
1e (5 points). Let \(\beta_1,\beta_2\) be the coefficients of \(X_1,X_2\) that you computed in the previous logisitic regression, and let \(\beta_0\) be the intercept. Consider the line defined by the equation \[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0. \] Via simple algebra, rewrite this equation as \[ X_2 = a + b X_1, \] for a particular \(a,b\) that you will compute. Then reproduce your plot from Q1b, and plot the line \(X_2 = a + b X_1\) on top, as a thick blue dashed line. (In case you’re curious, this is called the decision boundary from your logistic regression.) Modify your legend so that your line explained. Does your line look like a reasonable separator of groups A and B?
2a (2 points). Refer back to the logistic regression object you computed in Q1d. Using summary()
, form a vector of the standard errors of the intercept and coefficients. Display this vector (it should be of length 3).
Perform this iterative strategy (in case you’re curious, this is known as bootstrapping) on your data set from Q1c. Hint: a for()
loop is probably simplest. What are the standard deviations you get for the intercept and \(X_1\) and \(X_2\) coefficients? Are they close to the standard errors you found in Q2a? (They should be.)
2c (2 points). Rerun your code from Q2b, but now sampling without replacement (before you were sampling with replacement). What are the standard errors you estimate in the end? And does this make sense to you?
read.table()
; hint: make sure to have your data/working directory correctly specified, and also to set stringsAsFactors=FALSE
; save the resulting data frame as horse.df
; check that it has dimension 300 x 28; and set its column names to the string vector nms
below. Print out the first 5 rows and (all columns) of horse.df
to the console. To read more about the data set, visit https://archive.ics.uci.edu/ml/datasets/Horse+Colic.nms = c("Surgery", "Age", "Hosp_ID", "Temp_Rect", "Pulse", "Resp", "Temp_Extr",
"Pulse_Peri", "Mucous", "Cap_Time", "Pain", "Peris", "Ab_Dist", "Tube",
"Reflux", "Reflux_PH", "Feces", "Abdomen", "Cell_Vol", "Total_Protein",
"Ab_Appear", "Ab_Total_Prot", "Outcome", "Lesion", "Lesion_Site",
"Lesion_Type", "Lesion_Subt", "Lesion_Code")
3b (3 points). How many columns in horse.df
are of character type, and of integer type? How many columns have missing values (represented by “?” in this data set), and what are they (what are their columns names)? The UCI website reports that about 30% of this data set is missing. Is this accurate? What do you calculate the missingness percentage to be?
3c (3 points). Convert all of the “?” values in horse.df
to NA values. After this, convert all of the character columns to numeric, but make sure that the horse.df
object is still a data frame (just explicitly cast it to one, if you need to). When done, print the first 5 rows (and all columns).
3d (3 points). Arrange the horses (the rows in horse.df
) by Surgery
(1 means yes surgery, 2 means no surgery), and then by descending values of Pulse
. Report the Surgery
, Pulse
, Temp_Rect
(rectal temperature), and Ab_Dist
(abdominal distention: 1 being best, 4 being worst) variables, for all horses with Pulse
at least 130.
3e (3 points). For each Pain
level (1 being best, 5 being worst), compute the average Pulse
(and make sure to exclude NA values in this calculation), then report the Pain
and average Pulse
in decreasing order of average Pulse
.
apply
(12 points)We’re going to use the same dataset (in Lab 3) from the 2016 Summer Olympics in Rio de Janeiro, taken from https://github.com/flother/rio2016 (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio
.
rio = read.csv("rio.csv")
4a (3 points). Create a new data frame called sports
, which we’ll populate with information about each sporting event at the Summer Olympics. Initially, define sports
to contain a single variable called sport
which contains the names of the sporting events in alphabetical order. Then, add a column called n_participants
which contains the number of participants in each sport. Use one of the apply functions to determine the number of gold medals given out for each sport, and add this as a column called n_gold
. Using your newly created sports
data frame, calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest ratio? Which has the lowest?
4b (3 points). Use one of the apply functions to compute the average weight of the participants in each sport, and add this as a column to sports
called ave_weight
. Important: there are missing weights in the data set coded as NA
, but your column ave_weight
should ignore these, i.e., it should be itself free of NA
values. You will have to pass an additional argument to your apply call in order to achieve this. Hint: look at the help file for the mean()
function; what argument can you set to ignore NA
values? Once computed, display the average weights along with corresponding sport names, in decreasing order of average weight.
4c (3 points). As in the last part, compute the average weight of atheletes in each sport, but now separately for men and women. You should therefore add two new columns, called ave_weight_men
and ave_weight_women
, to sports
. Once computed, display the average weights along with corresponding sports, for men and women, each list sorted in decreasing order of average weight. Are the orderings roughly similar?
4d (3 points). Repeat the calculation as in the last part, but with BMI (body mass index) replacing weight.
bonus question (5 points). Use one of the apply functions to compute the proportion of women among participating atheletes in each sport. Use these proportions to recompute the average weight (over all athletes in each sport) from the ave_weight_men
and average_weight_women
columns, and define a new column ave_weight2
accordingly. Does ave_weight2
differ from ave_weight
? It should. Explain why. Then show how to recompute the average weight from ave_weight_men
and average_weight_women
in a way that exactly recreates average_weight
.
We need the same wage
dataset that we used in Lab 11, Question 2; so load it in.
5a (4 points). Install the gam
package, if you haven’t already, and load it into your R session with library(gam)
. Fit a generalized additive model, using gam()
with family="binomial"
, with the response variable being the indicator that wage
is larger than 250, and the predictor variables being year
, age
, and education
; as in the Question 4 in Lab 11, only use observations corresponding to the complete education levels. Also, in the call to gam()
, allow for age
to have a nonlinear effect by using s()
(leave year
and education
alone, and they will have the default—linear effects). Call the result wage.gam
. Display a summary with summary()
. Is the age
variable more or less significant, in terms of its p-value, to what you saw in the logistic regression model fitted in the last question? Also, plot the fitted effect for each predictor, using plot()
. Comment on the plots—does the fitted effect make sense to you? In particular, is there a strong nonlinearity associated with the effect of age
, and does this make sense?
5b (3 points). Using wage.gam
, predict the probability that a 30 year old person, who earned a Ph.D., will make over $250,000 in 2018.
5c (4 points). For a 32 year old person who earned a Ph.D., how long does he/she have to wait until there is a predicted probability of at least 20% that he/she makes over $250,000 in that year? Plot his/her probability of earning at least $250,000 over the future years—is this strictly increasing?