Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Drop Files Here Or Click to Upload

Or Get Complete Course Help

Jesse RossResume writing

(5/5)

573 Answers

Hire Me

Arun SainiPolitical science

(5/5)

886 Answers

Hire Me

Cristopher WebbTechnical writing

(5/5)

524 Answers

Hire Me

Heena AliManagement

(5/5)

696 Answers

Hire Me

R Programming

(5/5)

Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Questions

1. Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything but lower case letters and dots. For example, Bottle Volume (ml) variable must become bottle.volume.ml . Important: your imported sales data must contain exactly 2,417,909 rows. If you don’t see this number, it means your file has been downloaded incompletely or imported incorrectly.

Suggested functions to use (read their help files):

• colnames() — to access current column names and write back necessary changes into them.

• tolower() — to transform all characters in column names to lowercase.

• gsub() — to replace spaces " " with dots "." and parenthesis "()" with empty string "".

2. Creating store registry. Currently the data has full store info (name, address, etc.) stored for every liquor order, which means there is a lot of duplicate entries — the same store information is repeated multiple times, which is very memory inefficient. Instead, we would like to move all store data into a separate data registry, leaving only store number as a reference in our sales data.

2.1. Create a new dataframe stores that contains unique combinations of the following 7 vari- ables from the sales data: store.number , store.name , address , store.location ,

city , zip.code , county .

• There should be a total of 5207 rows in dataframe stores , but only 2278 unique values in stores$store.number variable.

• As long as those two numbers are not equal, you have duplicate entries in your store registry that need to be accounted for.

• Remember that any subset of sales with several columns is also a dataframe, so all you need is to select necessary columns from sales and assign those into new object

stores .

• Use unique() function to remove duplicates from newly created dataframe stores .

2.2. Having saved all relevant store info in a separate dataframe stores , we no longer need to have these variables in our main dataframe sales . Drop the following variables from

sales : store.name , address , store.location , city , zip.code , county .

2.3. Extract GPS coordinates from store.location variable and store them in new variables

latitude and longitude inside stores dataframe.

• A suggested way to do it is to apply strsplit() function twice: first to get the part of store.location inside the parenthesis and then to split that part into two coordinates (see class code for example on how to use strsplit() ).

• Since you will need to average GPS coordinates in the next step, use as.numeric() function to ensure that stores$latitude and stores$lontitude as stored as num- bers, not as character strings (you can use the command str(stores) to check the types of variables in your data).

2.4. Drop store.location variable from the stores dataframe.

2.5. Many records in our stores dataframe have slightly varying GPS coordinates (different in, say, 5th digit after the decimal point), but the same store.number value. That’s because GPS readings are subject to a certain variation even when taken at the same exact location, as lots of factors contribute to precision of those readings. One way to filter out such discrepancies is by replacing all varying GPS coordinates in longitude and latitude with their averages for all observations with the same store.number .

• A suggested way to do it is to first use aggregate() function to create a temporary dataframe coordinates containing only average values of stores$latitude and

stores$longitude per store.number .

• Then use merge() function to replace coordinates in stores with their average values from coordinates , using store.number in both datasets to match records in stores with the ones in coordinates .

• Because you are replacing all existing coordinates with their averages, it is recom- mended to drop variables longitude and latitude from stores before running

merge() command.

2.6. Remove duplicate rows from stores . You should now have 5192 unique rows in stores . Removed duplicates were due to the GPS coordinates now being the same for each unique value of store.number .

2.7. Another source of duplicates is the way values in address , city and county variables are recorded. From R’s perspective all of them are character variables, and thus values such as "Cedar Falls" and "CEDAR FALLS" are treated as different, and so are values such as "128 South St." and "128 South St".

• One could fix that by making all those letters lower case using R’s base tolower() function, but a better solution is to use str_to_title() function from stringr package — it makes every word start with a capital letter. Apply this function to

address , city and county .

• Also remove all dots and commas from address variable using gsub() command.

2.8. Remove duplicate rows from stores . You should now have 2877 unique rows in stores . Removed duplicates were due to the same address being recorded slightly differently for the same store.number .

2.9. Some of the stores with the same store.number are recorded twice because one of the records is missing county value. Fill in those missing values based on the observations for the same store.number that have the county name in county .

• Start with selecting unique pairs of store.number and non-missing county values from stores and storing them in a temporary dataframe counties . Use is.na() function to select rows from stores where the variable stores$county is not missing and then unique() function to extract unique pairs of store.number and county from those rows.

• Drop county variable from stores .

• Then use merge() function to add values of county$county back to stores based on the matching values of store.number in each dataset (similarly to how we added customer names into customers dataframe in class).

2.10. Remove duplicate rows from stores . You should now have 2792 unique rows in stores . Removed duplicates were due to the same store having county name recorded in one case and missing in another.

2.11. Remaining duplicates are due to typos in either store names, cities, counties or zip codes. Given that there’s no way to filter those errors out automatically, we need to manually review each duplicate case. To simplify the task, create a new variable dup that equals one if that store had more than one record in stores and 0 otherwise. There should be a total of 905 rows with dup = 1 .

• A suggested way to do it is to apply duplicated() function to store$store.number twice, with and without fromlast = TRUE argument.

• duplicated() function returns TRUE or FALSE values, which you can convert to 1s and 0s via the as.integer() function.

2.12. Import "iowa.geographies.csv" data into geo dataframe using read.csv() command. Create a new variable stores$match that equals 1 if a store record has a combination of zip/city/county that matches with a combination in geo and 0 otherwise.

• First create a new variable match in geo that is equal to 1 for all observations

• Then add the values of geo$match into stores by using merge() command with city, zipcode and county variables in both datasets as matching variables.

• After the merge the records in stores that did not have a matching record in geo

will have the values of stores$match equal to NA . You can replace those NA s with

zeros by filtering those rows using is.na() command and replacing the values of

stores$match with zeros only for those rows.

You should have 382 zeros in stores$match variable.

3. There are two major issues with our remaining sales data — prices recorded as characters because of the $ sign and liquor categories being too disaggregated.

3.1. Remove $ character from state.bottle.retail and sale.dollars variables using

gsub() function and convert them to numeric type using as.numeric() .

3.2. Create a new variable subcategory inside sales equal to current category.name vari- able. Drop category.name variable from sales .

3.3. Create a new variable category that will be equal to one of the three character strings ("Tequila", "Gin" or "Brandy") depending on what is specified in the subcategory vari- able.

• Start with creating an empty ( NA ) variable category inside sales dataframe. Then replace those empty values with "Tequila", "Gin" or "Brandy" depending on the value of subcategory variable.

• A suggested way to do it is to use grepl() function to subset rows in sales that have matching liquor category name in subcategory and then assign corresponding value to category .

• Make sure your code correctly identifies category from subcategory for values such as " Ginny Brandy" or "American Tequilarian Gin " (note the leading and trailing spaces). Check this cheatsheet for regular expressions to figure out how to do it. Eventually, you should have the following number of records per category:

(5/5)

Hurry, Grab up to 30% discount on the entire course

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Jesse RossResume writing

Arun SainiPolitical science

Cristopher WebbTechnical writing

Heena AliManagement

R Programming

Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

Other Services

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Jesse RossResume writing

Arun SainiPolitical science

Cristopher WebbTechnical writing

Heena AliManagement

R Programming

Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer