logo Use SA10RAM to get 10%* Discount.
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Basudev RoyMathematics
(/5)

996 Answers

Hire Me
expert
Claire BattershilllAccounting
(5/5)

699 Answers

Hire Me
expert
Douglas RichardsPsychology
(5/5)

674 Answers

Hire Me
expert
Pushpinder SinghData mining
(5/5)

871 Answers

Hire Me
R Programming
(5/5)

Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Questions

1. Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything but lower case letters and dots. For example, Bottle Volume (ml) variable must become bottle.volume.ml . Important: your imported sales data must contain exactly 2,417,909 rows. If you don’t see this number, it means your file has been downloaded incompletely or imported incorrectly.

Suggested functions to use (read their help files):

 

colnames() — to access current column names and write back necessary changes into them.

tolower() — to transform all characters in column names to lowercase.

gsub()  — to replace spaces " " with dots "." and parenthesis "()" with empty string "".

 

2. Creating store registry. Currently the data has full store info (name, address, etc.) stored for every liquor order, which means there is a lot of duplicate entries — the same store information is repeated multiple times, which is very memory inefficient. Instead, we would like to move all store data into a separate data registry, leaving only store number as a reference in our sales data.

2.1. Create a new dataframe stores that contains unique combinations of the following 7 vari- ables from the sales data: store.number , store.name , address , store.location ,

 city , zip.code ,  county .

There should be a total of 5207 rows in dataframe stores , but only 2278 unique values in stores$store.number variable.

As long as those two numbers are not equal, you have duplicate entries in your store registry that need to be accounted for.

Remember that any subset of sales with several columns is also a dataframe, so all you need is to select necessary columns from sales and assign those into new object

 stores .

Use  unique() function to remove duplicates from newly created dataframe  stores .

2.2. Having saved all relevant store info in a separate dataframe stores , we no longer need to have these variables in our main dataframe sales . Drop the following variables from

 sales :  store.name ,  address ,  store.location ,  city ,  zip.code ,  county .

2.3. Extract GPS coordinates from  store.location  variable and store them in new variables

 latitude and longitude inside  stores dataframe.

 

A suggested way to do it is to apply strsplit() function twice: first to get the part of store.location inside the parenthesis and then to split that part into two coordinates (see class code for example on how to use strsplit() ).

Since you will need to average GPS coordinates in the next step, use as.numeric() function to ensure that stores$latitude and stores$lontitude as stored as num- bers, not as character strings (you can use the command str(stores) to check the types of variables in your data).

2.4. Drop store.location variable from the stores dataframe.

2.5. Many records in our  stores  dataframe have slightly varying GPS coordinates (different in, say, 5th digit after the decimal point), but the same store.number value.  That’s because GPS readings are subject to a certain variation even when taken at the same exact location,  as lots of factors contribute to precision of those readings.  One way to filter out such discrepancies is by replacing all varying GPS coordinates in longitude and latitude with their averages for all observations with the same store.number .

A suggested way to do it is to first use aggregate() function to create a temporary dataframe coordinates containing only average values of stores$latitude and

 stores$longitude  per  store.number .

Then use merge()  function to replace coordinates in  stores  with their average values from  coordinates ,  using  store.number  in both datasets to match records in stores with the ones in coordinates .

Because you are replacing all existing coordinates with their averages, it is recom- mended to drop variables longitude and latitude from stores before running

 merge() command.

2.6. Remove duplicate rows from stores . You should now have 5192 unique rows in stores . Removed duplicates were due to the GPS coordinates now being the same for each unique value of store.number .

2.7. Another source of duplicates is the way values in  address ,  city and  county variables are recorded.  From R’s perspective all of them are character variables, and thus values such as "Cedar Falls" and "CEDAR FALLS" are treated as different,  and so are values such as "128 South St." and "128 South St".

One could fix that by making all those letters lower case using R’s base tolower() function, but a better solution is to use str_to_title() function from stringr package — it makes every word start with a capital letter. Apply this function to

 address , city  and  county .

Also remove all dots and commas from address variable using  gsub() command.

 

2.8. Remove duplicate rows from stores . You should now have 2877 unique rows in stores . Removed duplicates were due to the same address being recorded slightly differently for the same store.number .

2.9. Some of the stores with the same store.number are recorded twice because one of the records is missing county value. Fill in those missing values based on the observations for the same store.number that have the county name in county .

Start with selecting unique pairs of store.number and non-missing county values from stores and storing them in a temporary dataframe counties . Use is.na() function to select rows from stores where the variable stores$county is not missing and then unique() function to extract unique pairs of store.number and  county from those rows.

Drop county variable from  stores .

Then use merge() function to add values of  county$county back to  stores based on the matching values of store.number in each dataset (similarly to how we added customer names into customers dataframe in class).

2.10. Remove duplicate rows from stores . You should now have 2792 unique rows in stores . Removed duplicates were due to the same store having county name recorded in one case and missing in another.

2.11. Remaining duplicates are due to typos in either store names, cities, counties or zip codes. Given that there’s no way to filter those errors out automatically, we need to manually review each duplicate case.  To simplify the task, create a new variable  dup that equals one if that store had more than one record in stores and 0 otherwise. There should be a total of 905 rows with dup = 1 .

A suggested way to do it is to apply duplicated() function to store$store.number twice, with and without fromlast = TRUE argument.

duplicated() function returns TRUE or FALSE values, which you can convert to 1s and 0s via the as.integer() function.

2.12. Import "iowa.geographies.csv" data into geo dataframe using read.csv() command. Create a new variable stores$match that equals 1 if a store record has a combination of zip/city/county that matches with a combination in geo and 0 otherwise.

First create a new variable match in  geo that is equal to 1 for all observations

Then add the values of geo$match into stores by using merge() command with city, zipcode and county variables in both datasets as matching variables.

After the merge the records in stores that did not have a matching record in  geo 

will have the values of stores$match equal to NA . You can replace those NA s with

 

zeros by filtering those rows using is.na()  command and replacing the values of

 stores$match with zeros only for those rows.

You should have 382 zeros in stores$match variable.

 

3. There are two major issues with our remaining sales data — prices recorded as characters because of the $ sign and liquor categories being too disaggregated.

3.1. Remove  $  character  from   state.bottle.retail   and   sale.dollars   variables  using

 gsub() function and convert them to numeric type using as.numeric() .

3.2. Create a new variable subcategory inside sales equal to current category.name vari- able. Drop category.name variable from sales .

3.3. Create a new variable category that will be equal to one of the three character strings ("Tequila", "Gin" or "Brandy") depending on what is specified in the subcategory vari- able.

Start with creating an empty ( NA ) variable category inside sales dataframe. Then replace those empty values with "Tequila", "Gin" or "Brandy" depending on the value of subcategory variable.

A suggested way to do it is to use grepl() function to subset rows in sales that have matching liquor category name in subcategory and then assign corresponding value to category .

Make sure your code correctly identifies category from subcategory for values such as " Ginny Brandy" or "American Tequilarian Gin " (note the leading and trailing spaces). Check this cheatsheet for regular expressions to figure out how to do it. Eventually, you should have the following number of records per category:

 

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme

Get Free Quote!

326 Experts Online