logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Patriciaa GloverStatistics
(4/5)

873 Answers

Hire Me
expert
Oriane BuraniManagement
(5/5)

884 Answers

Hire Me
expert
Jenny BoothamComputer science
(5/5)

623 Answers

Hire Me
expert
Debra StephensonBusiness
(5/5)

552 Answers

Hire Me
Others

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS
  1. Hadoop Map Reduce – Sampling a dataset. 50 points

Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.

TO-DO – In Hadoop MapReduce code:

  • Write an input format that wraps the actual input format used to read data. Your code should work with TextInputFormat.

  • The input format that you’ll write should be configured, via arguments, with the number of samples that should be extracted from the wrapped input

  • The input test data for this problem is the same as homework 1: cookbook_text/*.txt.

(3 points each) Spark Scala, Spark Java or PySpark

 

  1. Create a single dataframe from all CSV files in the zip, with header information

  2. Show the dataframe columns

  3. Show the first 20 rows, sorted by (capacity descending, model ascending)

  4. Count the total number of rows

  5. Count the total number of rows, grouped by capacity

  6. Get the dataframe summary statistics

  7. Select the following columns: date, model, capacity

  8. Select the number of distinct models

  9. Calculate the pairwise frequency of this two columns (e.g. crosstab): capacity, smart_1_normalized

  10. Find the mean value of column capacity

 

3.    Spark Anomaly detection – Hard Drive Failures 60 points

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.

Anomalies can be broadly categorized as:

  1. Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount ”

  2. Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd

  3. Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

TO-DO, in Spark Scala, Spark Java, or PySpark (** no Pandas **):

 Given the hard drive logs for 2019 Q1, implement a K-Nearest Neighbors (KNN) point anomaly detector for:

a)      Normalized Read Error Rate, SMART attribute 1.

  1. Annualized Failure Rate (by model)

ETL: Computations/Transformations and data labeling (20 points): For an explanation of hard drive SMART attributes, refer to:

For an explanation of scripts required for these computations, refer to docs_Q1_2019.zip.

ANOMALY DETECTION : Machine Learning training, testing (40 points)

Implement KNN (supervised training) for:

a)      Normalized Read Error Rate, SMART attribute 1.

  1. Annualized Failure Rate (by model)

For generating labels, use for a) 100 , for b) 2%

Anomaly Detection Reference: https://www.datascience.com/blog/python-anomaly- detection

  1. Spark Bloom Filters and Broadcast Joins 50 points

Suppose you are interested in records on one dataset, Table A, based on values of another dataset, Table B. Generally, an inner join is used as a form of filtering.

Consider, for example, if Table A has 100’s of millions of rows, while Table B has only a few thousands.

In cases like this, you might want to avoid the shuffle that the join operation introduces, especially if the dataset you want to use for filtering is significantly smaller than the main dataset on which you will perform your further computation.

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme