logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
George RoperData mining
(5/5)

658 Answers

Hire Me
expert
Jyotika DasguptaStatistics
(/5)

873 Answers

Hire Me
expert
Alan DuderMarketing
(5/5)

984 Answers

Hire Me
expert
StatAnalytica ExpertEnglish
(5/5)

534 Answers

Hire Me
R Programming
(5/5)

Define a data frame dat of dimension 2000 x 3 First, stack the rows of a.mat and b.mat

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Applied Statistics and Data Science

Simulation and plotting (18 points)

1a (4 points). Simulate 1000 pairs (X1,X2) according to

X1∼N(0,22)andX2∼N(0,1),independently.

(here N(μ,σ2) denotes the normal distribution with mean μ and variance σ2.) Call this data “group A”, and save the data points in matrix called a.mat, of dimension 1000 x 2. Simulate 1000 pairs (X1,X2) according to

X1∼N(2.5,1)andX2∼N(3,22),independently.

Call this data “group B”, and save the data points in a matrix b.mat, of dimension 1000 x 2. Compute the column means and column standard deviations of a.mat and b.mat, and show that they’re close to what you would expect.

1b (4 points). On a single plot, plot X2 versus X1, using all the points from groups A and B. Color the points from group A in black, and those from group B in red. Set the x label to “X1”, the y label to “X2”, and leave the title blank. Make sure the axes limits are just big enough to show all the points on the plot, and set these programmatically (do not set these using hard constants). Add a legend to explain the color coding. Do you see significant overlap in the points from the two groups?

1c (3 points). Define a data frame dat of dimension 2000 x 3, as follows. First, stack the rows of a.mat and b.mat together, to get a data frame of dimension 2000 x 2. Then, append a column of 0s and 1s, that indicates whether the points (rows) come from group A or B (with 0 for A and 1 for B). Finally, name the columns of dat to be X1, X2, Y respectively. Show the first 3 and last 3 rows of dat.

1d (2 points). Perform a logistic regression of Y (response) on X1 and X2 (variables), using the data dat. Use the function glm() with family="binomial", and save the output object as log.mod. Allow for an intercept in the model (the default). Report the intercept, and the coefficients for X1 and X2.

1e (5 points). Let β1,β2 be the coefficients of X1,X2 that you computed in the previous logisitic regression, and let β0 be the intercept. Consider the line defined by the equation

β0+β1X1+β2X2=0.

Via simple algebra, rewrite this equation as

X2=a+bX1,

for a particular a,b that you will compute. Then reproduce your plot from Q1b, and plot the line X2=a+bX1 on top, as a thick blue dashed line. (In case you’re curious, this is called the decision boundary from your logistic regression.) Modify your legend so that your line explained. Does your line look like a reasonable separator of groups A and B?

Iteration (14 points)

2a (2 points). Refer back to the logistic regression object you computed in Q1d. Using summary(), form a vector of the standard errors of the intercept and coefficients. Display this vector (it should be of length 3).

2b (10 points). Now we’re going perform an iterative strategy to estimate these standard errors ourselves. It works as follows:

Sample 2000 (X1,X2,Y) triplets with replacement from your original data set of 2000 (X1,X2,Y) triplets (the data set with groups A and B together).

Run a logistic regression of Y on X1,X2 (including intercept), using the data constructed in from Step i. Store the coefficient vector.

Repeat Steps i and ii 100 times. You will have 100 coefficient vectors. Report the standard deviation of the intercept coefficients, and the X1 coefficients, and the X2 coefficients.

Perform this iterative strategy (in case you’re curious, this is known as bootstrapping) on your data set from Q1c. Hint: a for() loop is probably simplest. What are the standard deviations you get for the intercept and X1 and X2 coefficients? Are they close to the standard errors you found in Q2a? (They should be.)

2c (2 points). Rerun your code from Q2b, but now sampling without replacement (before you were sampling with replacement). What are the standard errors you estimate in the end? And does this make sense to you?

Data cleaning and wrangling (15 points)

3a (3 points). A dataset on horse colic is available at https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data. Download and save it locally. Then load it into your R session using read.table(); hint: make sure to have your data/working directory correctly specified, and also to set stringsAsFactors=FALSE; save the resulting data frame as horse.df; check that it has dimension 300 x 28; and set its column names to the string vector nms below. Print out the first 5 rows and (all columns) of horse.df to the console. To read more about the data set, visit https://archive.ics.uci.edu/ml/datasets/Horse+Colic.

nms = c("Surgery", "Age", "Hosp_ID", "Temp_Rect", "Pulse", "Resp", "Temp_Extr",

"Pulse_Peri", "Mucous", "Cap_Time", "Pain", "Peris", "Ab_Dist", "Tube",

"Reflux", "Reflux_PH", "Feces", "Abdomen", "Cell_Vol", "Total_Protein",

"Ab_Appear", "Ab_Total_Prot", "Outcome", "Lesion", "Lesion_Site",

"Lesion_Type", "Lesion_Subt", "Lesion_Code")

3b (3 points). How many columns in horse.df are of character type, and of integer type? How many columns have missing values (represented by “?” in this data set), and what are they (what are their columns names)? The UCI website reports that about 30% of this data set is missing. Is this accurate? What do you calculate the missingness percentage to be?

3c (3 points). Convert all of the “?” values in horse.df to NA values. After this, convert all of the character columns to numeric, but make sure that the horse.df object is still a data frame (just explicitly cast it to one, if you need to). When done, print the first 5 rows (and all columns).

3d (3 points). Arrange the horses (the rows in horse.df) by Surgery (1 means yes surgery, 2 means no surgery), and then by descending values of Pulse. Report the Surgery, Pulse, Temp_Rect (rectal temperature), and Ab_Dist (abdominal distention: 1 being best, 4 being worst) variables, for all horses with Pulse at least 130.

3e (3 points). For each Pain level (1 being best, 5 being worst), compute the average Pulse (and make sure to exclude NA values in this calculation), then report the Pain and average Pulse in decreasing order of average Pulse.

Data visulization and apply (12 points)

We’re going to use the same dataset (in Lab 3) from the 2016 Summer Olympics in Rio de Janeiro, taken from https://github.com/flother/rio2016 (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio. rio = read.csv("rio.csv")

4a (3 points). Create a new data frame called sports, which we’ll populate with information about each sporting event at the Summer Olympics. Initially, define sports to contain a single variable called sport which contains the names of the sporting events in alphabetical order. Then, add a column called n_participants which contains the number of participants in each sport. Use one of the apply functions to determine the number of gold medals given out for each sport, and add this as a column called n_gold. Using your newly created sports data frame, calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest ratio? Which has the lowest?

4b (3 points). Use one of the apply functions to compute the average weight of the participants in each sport, and add this as a column to sports called ave_weight. Important: there are missing weights in the data set coded as NA, but your column ave_weight should ignore these, i.e., it should be itself free of NA values. You will have to pass an additional argument to your apply call in order to achieve this. Hint: look at the help file for the mean() function; what argument can you set to ignore NA values? Once computed, display the average weights along with corresponding sport names, in decreasing order of average weight.

4c (3 points). As in the last part, compute the average weight of atheletes in each sport, but now separately for men and women. You should therefore add two new columns, called ave_weight_men and ave_weight_women, to sports. Once computed, display the average weights along with corresponding sports, for men and women, each list sorted in decreasing order of average weight. Are the orderings roughly similar?

4d (3 points). Repeat the calculation as in the last part, but with BMI (body mass index) replacing weight.

bonus question (5 points). Use one of the apply functions to compute the proportion of women among participating atheletes in each sport. Use these proportions to recompute the average weight (over all athletes in each sport) from the ave_weight_men and average_weight_women columns, and define a new column ave_weight2 accordingly. Does ave_weight2 differ from ave_weight? It should. Explain why. Then show how to recompute the average weight from ave_weight_men and average_weight_women in a way that exactly recreates average_weight.

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme