logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
StatAnalytica ExpertMathematics
(5/5)

586 Answers

Hire Me
expert
Hariom KasyapFinance
(5/5)

759 Answers

Hire Me
expert
Maeve WhitakerPhilosophy
(5/5)

696 Answers

Hire Me
expert
Djemila BemroseBusiness
(5/5)

605 Answers

Hire Me
R Programming
(5/5)

how household surveys could be used to measure poverty and income levels.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

In the lectures, we discussed how household surveys could be used to measure poverty and income levels. In this exercise, you will do independent work on a subset of the Indian National Sample Survey, downloaded from IPUMS.

You should submit an archived zip file for your assignment. The file should be named FirstName_LastName_A1.zip. It should contain:

  1. The Jupyter Notebook file, named FirstName_LastName_A1.ipynb, that contains:
    • The code that answers the questions that require code answers
    • Formatted Markdown text that answers the questions that require written (verbal) answers
  2. The IPUMS.dat and IPUMS.xml files that you will create in the prerequisites section below.

If you have trouble creating zip files, google or ask.

You are responsible for naming your file appropriately, and that the code should run without errors. Make sure to also organize your code and formatted text in a clear and easy-to-read manner.

The exercises use R.

Note that this assignment is a re-write of Chapter 5 in “R for Data Science” using other data examples. The authors of the original book are not responsible for the content of this assignment. You can find the original source at http://r4ds.had.co.nz/transform.html. You can find the license of use at https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode.

Data transformation

Introduction

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a dataset which contains a representative sample of Indian workers with recorded wage data from 2004.

Prerequisites

In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using a data on India taken from IPUMS. To download the IPUMS dataset, you need to take the following steps:

  1. Go to the IPUMS International homepage: https://international.ipums.org/international/ and log in with your user-name and password.
  2. Click on “Account” at the top, and then “Join a class” using the class code RA5PHL.
  3. Click on “My Data” and then on “Class Extracts.”

You should now be able to view the data for the exercise. Do the following steps to get it into R:

  1. Download the file Assignment2_Start.ipynb from Canvas and place it to your preferred working directory (project folder).
  2. On the IPUMS page, double click on Download.DAT to download a dat.gz file. Put the .gz file in your project folder, and extract it to your working directory. Rename it IPUMS.dat.
  3. Again on IPUMS, right-click on “DDI” (Data Documentation Initiative) to download an .xml file using save link as. Put it in your project folder and rename it IPUMS.xml.
  4. Open the file Assignment2_Start.ipynb and get started with your assignment. The file provides the code that loads the data and key packages. You should type your answers to the assignment in this jupyter notebook file in the provided cells, and rename it FirstName_LastName_A1.ipynb.

Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

india04

To explore the basic data manipulation verbs of dplyr, we’ll use india04. This data frame contains a representative sample of all Indian individuals between 15 and 75 years of age, that were employed with a positive registered wage in 2004. People with zero wages or missing wages have their income coded as missing. The data comes from IPUMS, and you can access documentation about the variables on the IPUMS International. To do so, enter browse data on the IPUMS webpage and find your variables. Click on a variable to obtain information about that variable. For information on finding variable documentation, check the recorded film on Canvas.

india04
## # A tibble: 167,079 x 9
##     year urban   age sex     geolev2 occisco             edattain incwage  perwt
##    <int> <chr> <int> <chr>     <dbl> <chr>               <chr>      <dbl>  <dbl>
##  1  2004 Rural    27 Male     3.56e8 Skilled agricultur~ <primary      NA  2467.
##  2  2004 Rural    25 Female   3.56e8 NIU (not in univer~ <primary      NA  2467.
##  3  2004 Rural    25 Male     3.56e8 Skilled agricultur~ <primary      NA  2467.
##  4  2004 Rural    22 Female   3.56e8 NIU (not in univer~ <primary      NA  2467.
##  5  2004 Rural    45 Male     3.56e8 Service workers an~ <primary     240  1850.
##  6  2004 Rural    42 Female   3.56e8 Skilled agricultur~ <primary      NA  1850.
##  7  2004 Rural    16 Male     3.56e8 NIU (not in univer~ primary       NA  1850.
##  8  2004 Rural    32 Male     3.56e8 NIU (not in univer~ <primary      NA  1850.
##  9  2004 Rural    30 Female   3.56e8 NIU (not in univer~ <primary      NA  1850.
## 10  2004 Rural    55 Male     3.56e8 Skilled agricultur~ <primary      NA 16187.
## # ... with 167,069 more rows

The dataframe will show slightly differently in your Jupyter Notebook, but the first 10 rows will be the almost same, up to some rounding for numbers.

You might also have noticed the row of three (or four) letter abbreviations under the column names above. These describe the type of each variable:

  • int stands for integers.

  • chr stands for character vectors, or strings.

  • dbl stands for doubles, or real numbers.

There are four other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:

  • lgl stands for logical, vectors that contain only TRUE or FALSE.

  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.

  • date stands for dates.

  • dttm stands for date-times (a date + a time).

To get the type of each variable in your Jupyter Notebook session, you can either run str(india04), or do it separately for each variable by running typeof('variable_name'), for example typeof('year'). Try it out.

dplyr basics

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

  1. The first argument is a data frame.

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme