Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Drop Files Here Or Click to Upload

Or Get Complete Course Help

Isaiah RobertsComputer science

(5/5)

595 Answers

Hire Me

Varun JakkaScience

(/5)

590 Answers

Hire Me

Faith BrownnComputer science

(4/5)

534 Answers

Hire Me

Paramjeet KaurManagement

(5/5)

761 Answers

Hire Me

Others

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Hadoop Map Reduce – Sampling a dataset. 50 points

Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.

TO-DO – In Hadoop MapReduce code:

Write an input format that wraps the actual input format used to read data. Your code should work with TextInputFormat.
The input format that you’ll write should be configured, via arguments, with the number of samples that should be extracted from the wrapped input
The input test data for this problem is the same as homework 1: cookbook_text/*.txt.

(3 points each) Spark Scala, Spark Java or PySpark

Create a single dataframe from all CSV files in the zip, with header information
Show the dataframe columns
Show the first 20 rows, sorted by (capacity descending, model ascending)
Count the total number of rows
Count the total number of rows, grouped by capacity
Get the dataframe summary statistics
Select the following columns: date, model, capacity
Select the number of distinct models
Calculate the pairwise frequency of this two columns (e.g. crosstab): capacity, smart_1_normalized
Find the mean value of column capacity

3. Spark Anomaly detection – Hard Drive Failures 60 points

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.

Anomalies can be broadly categorized as:

Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount ”
Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

TO-DO, in Spark Scala, Spark Java, or PySpark (** no Pandas **):

Given the hard drive logs for 2019 Q1, implement a K-Nearest Neighbors (KNN) point anomaly detector for:

a) Normalized Read Error Rate, SMART attribute 1.

Annualized Failure Rate (by model)

ETL: Computations/Transformations and data labeling (20 points): For an explanation of hard drive SMART attributes, refer to:

For an explanation of scripts required for these computations, refer to docs_Q1_2019.zip.

ANOMALY DETECTION : Machine Learning training, testing (40 points)

Implement KNN (supervised training) for:

a) Normalized Read Error Rate, SMART attribute 1.

Annualized Failure Rate (by model)

For generating labels, use for a) 100 , for b) 2%

Anomaly Detection Reference: https://www.datascience.com/blog/python-anomaly- detection

Spark Bloom Filters and Broadcast Joins 50 points

Suppose you are interested in records on one dataset, Table A, based on values of another dataset, Table B. Generally, an inner join is used as a form of filtering.

Consider, for example, if Table A has 100’s of millions of rows, while Table B has only a few thousands.

In cases like this, you might want to avoid the shuffle that the join operation introduces, especially if the dataset you want to use for filtering is significantly smaller than the main dataset on which you will perform your further computation.

Hurry, Grab up to 30% discount on the entire course

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Isaiah RobertsComputer science

Varun JakkaScience

Faith BrownnComputer science

Paramjeet KaurManagement

Others

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

ANSWER ALL QUESTIONS

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

Other Services

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Isaiah RobertsComputer science

Varun JakkaScience

Faith BrownnComputer science

Paramjeet KaurManagement

Others

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

ANSWER ALL QUESTIONS

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer