Assignment introductions
In the lectures, we discussed how household surveys could be used to measure poverty and income levels. In this exercise, you will do independent work on a subset of the Indian National Sample Survey, downloaded from IPUMS.
You should submit an archived zip file for your assignment. The file should be named FirstName_LastName_A1.zip. It should contain:
If you have trouble creating zip files, google or ask.
You are responsible for naming your file appropriately, and that the code should run without errors. Make sure to also organize your code and formatted text in a clear and easy-to-read manner.
The exercises use R.
Note that this assignment is a re-write of Chapter 5 in “R for Data Science” using other data examples. The authors of the original book are not responsible for the content of this assignment. You can find the original source at http://r4ds.had.co.nz/transform.html. You can find the license of use at https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode.
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a dataset which contains a representative sample of Indian workers with recorded wage data from 2004.
In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using a data on India taken from IPUMS. To download the IPUMS dataset, you need to take the following steps:
You should now be able to view the data for the exercise. Do the following steps to get it into R:
Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().
To explore the basic data manipulation verbs of dplyr, we’ll use india04. This data frame contains a representative sample of all Indian individuals between 15 and 75 years of age, that were employed with a positive registered wage in 2004. People with zero wages or missing wages have their income coded as missing. The data comes from IPUMS, and you can access documentation about the variables on the IPUMS International. To do so, enter browse data on the IPUMS webpage and find your variables. Click on a variable to obtain information about that variable. For information on finding variable documentation, check the recorded film on Canvas.
india04
## # A tibble: 167,079 x 9 ## year urban age sex geolev2 occisco edattain incwage perwt ## <int> <chr> <int> <chr> <dbl> <chr> <chr> <dbl> <dbl> ## 1 2004 Rural 27 Male 3.56e8 Skilled agricultur~ <primary NA 2467. ## 2 2004 Rural 25 Female 3.56e8 NIU (not in univer~ <primary NA 2467. ## 3 2004 Rural 25 Male 3.56e8 Skilled agricultur~ <primary NA 2467. ## 4 2004 Rural 22 Female 3.56e8 NIU (not in univer~ <primary NA 2467. ## 5 2004 Rural 45 Male 3.56e8 Service workers an~ <primary 240 1850. ## 6 2004 Rural 42 Female 3.56e8 Skilled agricultur~ <primary NA 1850. ## 7 2004 Rural 16 Male 3.56e8 NIU (not in univer~ primary NA 1850. ## 8 2004 Rural 32 Male 3.56e8 NIU (not in univer~ <primary NA 1850. ## 9 2004 Rural 30 Female 3.56e8 NIU (not in univer~ <primary NA 1850. ## 10 2004 Rural 55 Male 3.56e8 Skilled agricultur~ <primary NA 16187. ## # ... with 167,069 more rows
The dataframe will show slightly differently in your Jupyter Notebook, but the first 10 rows will be the almost same, up to some rounding for numbers.
You might also have noticed the row of three (or four) letter abbreviations under the column names above. These describe the type of each variable:
int stands for integers.
chr stands for character vectors, or strings.
dbl stands for doubles, or real numbers.
There are four other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:
lgl stands for logical, vectors that contain only TRUE or FALSE.
fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.
dttm stands for date-times (a date + a time).
To get the type of each variable in your Jupyter Notebook session, you can either run str(india04), or do it separately for each variable by running typeof('variable_name'), for example typeof('year'). Try it out.
In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme