logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Pushpinder SinghData mining
(5/5)

968 Answers

Hire Me
expert
Juan FloresEnglish
(5/5)

997 Answers

Hire Me
expert
Expert TeamComputer science
(5/5)

778 Answers

Hire Me
expert
Djemila BemroseBusiness
(5/5)

666 Answers

Hire Me
R Programming
(5/5)

What are all the things that you could imagine that could go wrong with your data collection and how are you going to find them

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Cleaning Up Survival Data.

What are all the things that you could imagine that could go wrong with your data collection and how are you going to find them? 

•Load the “MessedUpBeyondAllReason.csv” into a data frame called “mubar”.  NOTE, this is a “.csv” file and you will want to treat it appropriately in order to get it to load properly.  Show the full data set here because it is only 8 rows.  This data set should have the data for 2 different DGRP lines (LINE 1 and LINE 2).  There should be a single replicate vial for Low Sugar (low) and High Sugar (High) for each line and each replicate vial should have a row for MALE and FEMALE flies.

•How many rows and columns are there?

•This data should look very similar to the format that will be present in your google sheets data set. Look, by eye, at the data and tell me all the errors you find. EVERY ROW HAS SOME ISSUE, SOME ARE HARD TO SPOT. DON’T READ ON UNTIL YOU FIND ALL THE PROBLEMS. “R” is very particular about stuff. A is not the same as a, for example.

How can you get find and remove all the wrong data?  This is tricky but if you think about it carefully, it can be done AND it needs to be done if you want to have high quality data.

Several of the issues (MaLE vs MALE, c instead of C, wrong line numbers…) can be identified just by counting the number of entries in each of the different columns (you know the function to do this)

BEFORE YOU DO ANYTHING, I WOULD MAKE A NEW DATA FRAME CALLED “mubar.b” which stands for mubar backup.  You can go back to this if you think there are any problems and you don’t have to start over from scratch if you make a mistake that really messes things up.  As you progress, I will “encourage” you to continue to make these backups.

•Make a new data frame called “mubar.b” and store the “mubar” data in it.  If you make a mistake, you can take the data that is in “mubar.b” and put it back into “mubar”.

•How many different DGRP LINES are there and how many different entries are there for each LINE?

•Is this what the data is supposed to look like? Do you think there are 3 LINES? Based on what the data is supposed to look like is it fair to change the value of “26” to “2”? 

•Change the LINE “26” to read “2” and print the data to the screen to verify the change.

Now work through similar logic for mubar$TREAT and mubar$SEX and mubar$REP and find the errors and fix them if you are absolutely certain that you can determine things that are just typos. If you are not absolutely certain about typos, delete the entire entry. 

•Identify and fix any problems with mubar$TREAT. Use the approach that you already learned that DOES NOT require the which() function. 

•Verify that the error has been corrected using the table() function.

•Identify and fix any problems with mubar$SEX. Again use the approach that DOES NOT require the which() function.

•Verify that the change has been made using the table() function.

•Identify any problems with mubar$REP.  Are there any?

Now we can start investigating the values in the DAY columns but before we do that let’s save “mubar” in “mubar.b”.

•Verify that “mubar” is a data frame and if it is, save “mubar” as “mubar.b” so you have a backup that is correct up to this point.

Is there any missing data and if so, where in the data set? We found out how to do this in previous assignments, but there is a cool argument that can be changed for the which() function.  Check out the “arr.ind” argument.  The default setting is “FALSE” but it might be helpful to change it to “TRUE”.

•Use the which() function to determine where the missing data is. Use the default setting for “arr.ind” and also change it to “TRUE”

•How many rows of data have missing data and will need to be removed? 

oYou don’t need to do any additional R commands here, base your answer on the above commands?  I am asking this question because it is possible that a single row of data has more than 1 value of missing data so it would appear that you would need to remove multiple rows but in reality you only need to remove a single row of data. Note:  it is not always required to remove replicates with missing data from the data set prior to analysis because most R functions have ways to deal with it. BUT, for simplicity and consistency and ease of grading for the TAs, we are are going to do it in this class.

•Remove any row that has missing data and save the resulting data frame back into the current “mubar” data frame using the the na.omit() function.

As is the cases with many things in R, there are many ways to remove missing data.  Another useful function is the complete.cases() function.  This command says replace the data frame “mubar” with only those rows of “mubar” that have complete data (i.e. no NA). Here is the script for that.

mubar<-mubar[complete.cases(mubar),]

OR, take the information from the which function (with arr.ind = TRUE) and delete the row that has missing data by hand (note, it is rare that you would want to do it this way, but it is good to know how).  Here is the script for that noting that you previous identified row 2 and the row with the missing data.

mubar<-mubar[-c(2),]

•Verify that data set is of the correct size after removing you have removed the row with the missing value in it.

•And just for good measure, verify that there is no missing data.

•If you have done everything correctly, I would resave this data frame as a new “backup” called “mubar.b2”.  Now you have the original back and another backup that is a bit farther along (just in case you realize you messed up).

Another type of data that we are going to remove are the cases where there were no flies of a given sex in the original, DAY_1 vial.  The subset() function is a great way to do this.  Note, there are different ways to form the correct logical operator to do this.  For example, you can include specific data or exclude specific data and some ways are better than other ways.  It is usually always better to only include the data that you want because it is often easier to know all the things you what you rather than knowing all the things you don’t want 

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme