Questions
1. Importing data. Import r.hw02.csv data and rename all the variables in the data so that they do not contain anything but lower case letters and dots. For example, Bottle Volume (ml) variable must become bottle.volume.ml . Important: your imported sales data must contain exactly 2,417,909 rows. If you don’t see this number, it means your file has been downloaded incompletely or imported incorrectly.
Suggested functions to use (read their help files):
• colnames() — to access current column names and write back necessary changes into them.
• tolower() — to transform all characters in column names to lowercase.
• gsub() — to replace spaces " " with dots "." and parenthesis "()" with empty string "".
2. Creating store registry. Currently the data has full store info (name, address, etc.) stored for every liquor order, which means there is a lot of duplicate entries — the same store information is repeated multiple times, which is very memory inefficient. Instead, we would like to move all store data into a separate data registry, leaving only store number as a reference in our sales data.
2.1. Create a new dataframe stores that contains unique combinations of the following 7 vari- ables from the sales data: store.number , store.name , address , store.location ,
city , zip.code , county .
• There should be a total of 5207 rows in dataframe stores , but only 2278 unique values in stores$store.number variable.
• As long as those two numbers are not equal, you have duplicate entries in your store registry that need to be accounted for.
• Remember that any subset of sales with several columns is also a dataframe, so all you need is to select necessary columns from sales and assign those into new object
stores .
• Use unique() function to remove duplicates from newly created dataframe stores .
2.2. Having saved all relevant store info in a separate dataframe stores , we no longer need to have these variables in our main dataframe sales . Drop the following variables from
sales : store.name , address , store.location , city , zip.code , county .
2.3. Extract GPS coordinates from store.location variable and store them in new variables
latitude and longitude inside stores dataframe.
• A suggested way to do it is to apply strsplit() function twice: first to get the part of store.location inside the parenthesis and then to split that part into two coordinates (see class code for example on how to use strsplit() ).
• Since you will need to average GPS coordinates in the next step, use as.numeric() function to ensure that stores$latitude and stores$lontitude as stored as num- bers, not as character strings (you can use the command str(stores) to check the types of variables in your data).
2.4. Drop store.location variable from the stores dataframe.
2.5. Many records in our stores dataframe have slightly varying GPS coordinates (different in, say, 5th digit after the decimal point), but the same store.number value. That’s because GPS readings are subject to a certain variation even when taken at the same exact location, as lots of factors contribute to precision of those readings. One way to filter out such discrepancies is by replacing all varying GPS coordinates in longitude and latitude with their averages for all observations with the same store.number .
• A suggested way to do it is to first use aggregate() function to create a temporary dataframe coordinates containing only average values of stores$latitude and
stores$longitude per store.number .
• Then use merge() function to replace coordinates in stores with their average values from coordinates , using store.number in both datasets to match records in stores with the ones in coordinates .
• Because you are replacing all existing coordinates with their averages, it is recom- mended to drop variables longitude and latitude from stores before running
merge() command.
2.6. Remove duplicate rows from stores . You should now have 5192 unique rows in stores . Removed duplicates were due to the GPS coordinates now being the same for each unique value of store.number .
2.7. Another source of duplicates is the way values in address , city and county variables are recorded. From R’s perspective all of them are character variables, and thus values such as "Cedar Falls" and "CEDAR FALLS" are treated as different, and so are values such as "128 South St." and "128 South St".
• One could fix that by making all those letters lower case using R’s base tolower() function, but a better solution is to use str_to_title() function from stringr package — it makes every word start with a capital letter. Apply this function to
address , city and county .
• Also remove all dots and commas from address variable using gsub() command.
2.8. Remove duplicate rows from stores . You should now have 2877 unique rows in stores . Removed duplicates were due to the same address being recorded slightly differently for the same store.number .
2.9. Some of the stores with the same store.number are recorded twice because one of the records is missing county value. Fill in those missing values based on the observations for the same store.number that have the county name in county .
• Start with selecting unique pairs of store.number and non-missing county values from stores and storing them in a temporary dataframe counties . Use is.na() function to select rows from stores where the variable stores$county is not missing and then unique() function to extract unique pairs of store.number and county from those rows.
• Drop county variable from stores .
• Then use merge() function to add values of county$county back to stores based on the matching values of store.number in each dataset (similarly to how we added customer names into customers dataframe in class).
2.10. Remove duplicate rows from stores . You should now have 2792 unique rows in stores . Removed duplicates were due to the same store having county name recorded in one case and missing in another.
2.11. Remaining duplicates are due to typos in either store names, cities, counties or zip codes. Given that there’s no way to filter those errors out automatically, we need to manually review each duplicate case. To simplify the task, create a new variable dup that equals one if that store had more than one record in stores and 0 otherwise. There should be a total of 905 rows with dup = 1 .
• A suggested way to do it is to apply duplicated() function to store$store.number twice, with and without fromlast = TRUE argument.
• duplicated() function returns TRUE or FALSE values, which you can convert to 1s and 0s via the as.integer() function.
2.12. Import "iowa.geographies.csv" data into geo dataframe using read.csv() command. Create a new variable stores$match that equals 1 if a store record has a combination of zip/city/county that matches with a combination in geo and 0 otherwise.
• First create a new variable match in geo that is equal to 1 for all observations
• Then add the values of geo$match into stores by using merge() command with city, zipcode and county variables in both datasets as matching variables.
• After the merge the records in stores that did not have a matching record in geo
will have the values of stores$match equal to NA . You can replace those NA s with
zeros by filtering those rows using is.na() command and replacing the values of
stores$match with zeros only for those rows.
You should have 382 zeros in stores$match variable.
3. There are two major issues with our remaining sales data — prices recorded as characters because of the $ sign and liquor categories being too disaggregated.
3.1. Remove $ character from state.bottle.retail and sale.dollars variables using
gsub() function and convert them to numeric type using as.numeric() .
3.2. Create a new variable subcategory inside sales equal to current category.name vari- able. Drop category.name variable from sales .
3.3. Create a new variable category that will be equal to one of the three character strings ("Tequila", "Gin" or "Brandy") depending on what is specified in the subcategory vari- able.
• Start with creating an empty ( NA ) variable category inside sales dataframe. Then replace those empty values with "Tequila", "Gin" or "Brandy" depending on the value of subcategory variable.
• A suggested way to do it is to use grepl() function to subset rows in sales that have matching liquor category name in subcategory and then assign corresponding value to category .
• Make sure your code correctly identifies category from subcategory for values such as " Ginny Brandy" or "American Tequilarian Gin " (note the leading and trailing spaces). Check this cheatsheet for regular expressions to figure out how to do it. Eventually, you should have the following number of records per category:
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme