logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Bashi LCriminology
(5/5)

898 Answers

Hire Me
expert
Paul BurlingComputer science
(5/5)

631 Answers

Hire Me
expert
Caden ButlerEngineering
(5/5)

882 Answers

Hire Me
expert
Zuber KhanEconomics
(5/5)

902 Answers

Hire Me
Java Programming

write a MapReduce program which takes review.json as input and for each business_id it outputs the number of reviews that are given to that business id together with its average stars.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Assignment 1-MapReduce Warm-up (20 points)

This assignment has three short coding problems. Please read the instructions carefully and submit all the required files on the blackboard.

Problem 1- Processing Yelp Review Dataset (10 pts)

For this lab, you take a sample of the yelp review dataset, explained below, and for each business id find the number of reviews and the average stars given to that business id.

The data is made available to the public by Yelp and is in JSON format. Please go to https://www.yelp.com/dataset and click on “Download Dataset” then enter your name, email address, and your initials, check agree to dataset license and click on Download. You will be redirected to another page. Click on “Download JSON” to download the dataset. You can read the documentation of the JSON dataset here: https://www.yelp.com/dataset/documentation/json. The file that you will be using for this assignment is called review.json and it is about 3.6 GB compressed. This file contains a sample of reviews given by users to each business and includes the business and user ids

A JSON object is very similar to the XML tag in that it consists of a set of attributes and their values. Below is an example of a JSON object in review.json file. Each line of review.json contains a single JSON object (I pasted each attribute in a separate line for readability here but in the dataset these are all in one line)

{

// string, 22 character unique review id

"review_id": "zdSx_SD6obEhz9VrW9uAWA",

 

// string, 22 character unique user id, maps to the user in user.json

"user_id": "Ha3iJu77CxlrFm-vQRs_8g",

 

// string, 22 character business id, maps to business in business.json

"business_id": "tnhfDv5Il8EaGSXZGiuQGg",

 

// integer, star rating

"stars": 4,

 

// string, date formatted YYYY-MM-DD

"date": "2016-03-09",

}

 

To process json in java, you can add the following library in your pom.xml:

 

 

org.json

json

 

20180130

 

Then in your map function, you can create a json object and extract the attributes you want by calling the method “get” on the json object. For example, you can extract the attribute “stars” in your map function as follows (suppose that “value” is the value passed as an argument to your map function):

JSONObject jsn= new JSONObject(value.toString()); int stars= (Integer)jsn.get(“stars”);

You can extract the other attributes you need in a similar fashion.

What you need to do:

You need to write a MapReduce program which takes review.json as input and for each business_id it outputs the number of reviews that are given to that business id together with its average stars.

business_id   average_stars review_count

You can first create a smaller sample of the review.json file and test your program on this sample. For example, you can use the following unix shell command to copy the first 100K lines of the review.json in another file called review_small.json and run your program locally on this sample:

head -100000 review.json >> review_sample.json

Once you are confident that your program works correctly on a smaller sample, right click on your project folder on eclipse, click on export, and click on “runnable jar” to export your program as a runnable jar. The

reason we are exporting it as a runnable jar is because we would like to package the json library we used with the jar file, that way all nodes running the map function will have access to that library.

Attention: You do not need to specify the name of your driver class in Hadoop jar command if your mapreduce program is exported as a runnable jar. In that case you only need to specify the paths to your input and output files. That is,

Hadoop jar

Hint:

You can emit multiple attributes as a key from your map or reduce function by appending them together and send them as a Text object. For example, if you want to send both A1 and A2 as key from your reducer, you can emit new Text( A1+”,”+A2) as key.

What you need to submit:

  1. The source code for your mapper, reducer, driver, and combiner. Please name your driver class as YelpAverageStar.java

Problem 2—Increase the performance of your program for problem1 with a custom combiner (10 pts)

Modify your solution to problem 1 and use a custom combiner to increase the performance of your

MapReduce program (please refer to the lectures “more on MapReduce” slides 29-44). Run and debug your program on a smaller data. Once you are sure that your program works correctly, copy the yelp review data to hdfs, create a jar file and run the program on your three node yarn cluster. Once your job is completed, record the job elapsed time and reduce shuffle size (the reduce shuffle size is printed on the terminal once the job is completed. You can find the job elapsed in Yarn GUI, the application history). Then go back to your program and comment the line for using the combiner and run your program again without combiner on the cluster. Record the job elapsed time and the reduce shuffle size again. Does your program run faster when using combiners? What is the reduce shuffle size with and without using a combiner?

 

What you need to submit:

  1. The source code for your mapper, reducer, driver, and combiner. Please name your driver class as YelpAverageReviewWithCombiner.java

  2. A document that compares the shuffle size and elapsed time of the job with and without the combiner.

Problem 3—Finding pair of Flights with the maximum number of cancellation per carrier(Optional +5 pts)

 Description

If you want to get more practice with MapReduce and some extra points then this problem is for you. You are given a large dataset of flight arrival and departure details for all commercial flights within the USA between the years 1987-2000. The original dataset is about 5.5 GB and is extracted. stat- computing.org/dataexpo/2009/the-data.html click on the link and download files 1987 to 2000 and copy them in a folder on your virtual machine. You can name the folder anything you want, for example, flightdata. The goal is to write a simple MapReduce program to find which origin/destination pair had the most number of cancelled flights for each unique carrier.

What is the format of the dataset and how can you access it?

Each file contains flight information for a particular year. For example 2000.csv contains all flight information for year 2000.

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme