logo Hurry, Grab up to 30% discount on the entire course
Order Now logo
497 Times Downloaded

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Joan DometttData mining
(5/5)

748 Answers

Hire Me
expert
Jon CulshawComputer science
(5/5)

745 Answers

Hire Me
expert
Writer CenterAccounting
(5/5)

531 Answers

Hire Me
expert
Erlanda AlvesMarketing
(5/5)

558 Answers

Hire Me
Python Programming
(5/5)

This lab is specially designed for you to learn Spark MLlib with Python on the Databricks platform

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Instructions:

This lab is specially designed for you to learn Spark MLlib with Python on the Databricks platform step by step, including tools for data preprocessing, Machine Learning (ML) pipelines, several different machine learning algorithms, and corresponding ML applications.

This lab includes two tasks/mini-projects: 1) market basket analysis and recommendation, and 2) binary classification using logistic regression. All of these are very helpful for your final project. More specifically, in this lab, you will learn

How to preprocess data with Spark MLlib

How to use machine learning pipelines for optimization with Spark MLlib

how to conduct advanced data analytics with Spark MLlib

o Association rules

o Classification

o Hyperparameter tuning and cross validation

o Market basket analysis

o Product recommendation

how to embed SQL query in Spark on Databricks

In this lab, Databricks Runtime 7.3 ML or above is required for creating a new cluster to run

Spark MLlib application programs.

 

Task 1 (40 points + Extra Credit 20 points): Market Basket Analysis and Recommendation

The goal of this task is to use Spark MLlib to build a model to generate association rules to quickly run the market basket analysis to uncover associations between different items, then further to provide recommendations for purchase on a distributed platform.

Please follow the following step-by-step instructions to write the PySpark code with Spark MLlib to complete Task 2:

1.1 (2 points) Create a new notebook on Databricks, AIT614-Sec#_Lab5.1_YourName.ipynb, and write your full name and course# with section# on the notebook.

1.2 (3 points) Please read Spark MLlib Main Guide - Frequent Pattern Mining and copy FP-Growth

Python code example into the notebook and run it.

Hint: The outputs should be:

 

1.3 (10 points) Create a training set. Create your own dataset for training. For example, you can create different kinds of food items, and then display this dataset. The records/rows of the dataset should be >= 15.

Hint: For example, the output of some rows could be:

 

1.4 (5 points) Use FP-Growth to build a model. You can set different minSupport and minConfidence

values to test the model.

1.5 (2 points) Display frequent itemsets.

Hint: For example, the output could be:

1.6 (3 points) Display generated association rules.

Hint: For example, the output could be:

1.7 (5 points) Create a test set. Create your own dataset for testing. The rows of the dataset should be >= 5.

Hint: For example, the output could be:

1.8 (5 points) Make recommendation/prediction.

Hint: For example, the output could be:

 

1.9 (5 points) List citations/references to appreciate other’s work in a professional manner.

1.10 (Extra Credit 20 points) Load a real-world dataset and split into training and test sets then generate corresponding outputs.

 

Task 2 (60 points): Binary Classification using Logistic Regression

 

The goal of this task is to build a machine learning pipeline including a classification model that predicts the `Attrition` (Yes or No) from the features included in the dataset (income, work years, education level, marital status, job role, and so on), which we used in the Lab 3 and Lab 4.

Please follow the below step-by-step instructions and Databrick Spark MLlib tutorials and code examples: Getting started with MLlib.* to write the PySpark code with MLlib to complete the Task 1:

2.1 (2 points) Create a new notebook on Databricks, AIT614-Sec#_Lab5.2_YourName.ipynb, and write your full name and course# with section# on the notebook.

2.2 (5 points) Please follow the Lab 4 instruction to read the data file EmployeeAttrition.csv into the notebook and display the dataset schema.

Hint: Please add inferSchema ="true" in the file reading. The code snippets should be similar to:

 2.3 (5 points) Randomly split data into training and test sets. Display the distribution of the number of HourlyRate using summary() and Education using groupBy() in the training set. Hint: The outputs are similar to:

2.4 (12 points) Feature preprocessing.

2.4.1 (5 points) In the displayed schema, you can find the categorical features with string data type. Select five categorical features "Department", "EducationField”, “Gender", "JobRole", and "MaritalStatus" and convert to index vector using stringIndexer() only.

 

 

2.4.2 (2 points) Set Attrition feature (Yes or No) as a label.

 

2.4.3 (3 points) Select these numeric features: "Age", "DailyRate", "Education", "DistanceFromHome", "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "NumCompaniesWorked", "PerformanceRating", "EnvironmentSatisfaction"

2.4.4 (2 points) Combine all feature columns into a single feature vector.

2.5 (2 points) Define the model. Use a logistic regression model LogisticRegression().

2.6 (8 points) Build the pipeline. Define a pipeline using pipeline() and then apply the pipeline model to the test dataset. Note that in this step, it may take a few minutes to build. Then display the predictions from the model.

Hint: Note that only FOUR stages in the pipeline. The output is similar to:

 

(5/5)
Attachments:

Expert's Answer

497 Times Downloaded

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme