Instructions:
This lab is specially designed for you to learn Spark MLlib with Python on the Databricks platform step by step, including tools for data preprocessing, Machine Learning (ML) pipelines, several different machine learning algorithms, and corresponding ML applications.
This lab includes two tasks/mini-projects: 1) market basket analysis and recommendation, and 2) binary classification using logistic regression. All of these are very helpful for your final project. More specifically, in this lab, you will learn
• How to preprocess data with Spark MLlib
• How to use machine learning pipelines for optimization with Spark MLlib
• how to conduct advanced data analytics with Spark MLlib
o Association rules
o Classification
o Hyperparameter tuning and cross validation
o Market basket analysis
o Product recommendation
• how to embed SQL query in Spark on Databricks
In this lab, Databricks Runtime 7.3 ML or above is required for creating a new cluster to run
Spark MLlib application programs.
Task 1 (40 points + Extra Credit 20 points): Market Basket Analysis and Recommendation
The goal of this task is to use Spark MLlib to build a model to generate association rules to quickly run the market basket analysis to uncover associations between different items, then further to provide recommendations for purchase on a distributed platform.
Please follow the following step-by-step instructions to write the PySpark code with Spark MLlib to complete Task 2:
1.1 (2 points) Create a new notebook on Databricks, AIT614-Sec#_Lab5.1_YourName.ipynb, and write your full name and course# with section# on the notebook.
1.2 (3 points) Please read Spark MLlib Main Guide - Frequent Pattern Mining and copy FP-Growth
Python code example into the notebook and run it.
Hint: The outputs should be:
1.3 (10 points) Create a training set. Create your own dataset for training. For example, you can create different kinds of food items, and then display this dataset. The records/rows of the dataset should be >= 15.
Hint: For example, the output of some rows could be:
1.4 (5 points) Use FP-Growth to build a model. You can set different minSupport and minConfidence
values to test the model.
1.5 (2 points) Display frequent itemsets.
Hint: For example, the output could be:
1.6 (3 points) Display generated association rules.
Hint: For example, the output could be:
1.7 (5 points) Create a test set. Create your own dataset for testing. The rows of the dataset should be >= 5.
Hint: For example, the output could be:
1.8 (5 points) Make recommendation/prediction.
Hint: For example, the output could be:
1.9 (5 points) List citations/references to appreciate other’s work in a professional manner.
1.10 (Extra Credit 20 points) Load a real-world dataset and split into training and test sets then generate corresponding outputs.
Task 2 (60 points): Binary Classification using Logistic Regression
The goal of this task is to build a machine learning pipeline including a classification model that predicts the `Attrition` (Yes or No) from the features included in the dataset (income, work years, education level, marital status, job role, and so on), which we used in the Lab 3 and Lab 4.
Please follow the below step-by-step instructions and Databrick Spark MLlib tutorials and code examples: Getting started with MLlib.* to write the PySpark code with MLlib to complete the Task 1:
2.1 (2 points) Create a new notebook on Databricks, AIT614-Sec#_Lab5.2_YourName.ipynb, and write your full name and course# with section# on the notebook.
2.2 (5 points) Please follow the Lab 4 instruction to read the data file EmployeeAttrition.csv into the notebook and display the dataset schema.
Hint: Please add inferSchema ="true" in the file reading. The code snippets should be similar to:
2.3 (5 points) Randomly split data into training and test sets. Display the distribution of the number of HourlyRate using summary() and Education using groupBy() in the training set. Hint: The outputs are similar to:
2.4 (12 points) Feature preprocessing.
2.4.1 (5 points) In the displayed schema, you can find the categorical features with string data type. Select five categorical features "Department", "EducationField”, “Gender", "JobRole", and "MaritalStatus" and convert to index vector using stringIndexer() only.
2.4.2 (2 points) Set Attrition feature (Yes or No) as a label.
2.4.3 (3 points) Select these numeric features: "Age", "DailyRate", "Education", "DistanceFromHome", "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager", "NumCompaniesWorked", "PerformanceRating", "EnvironmentSatisfaction"
2.4.4 (2 points) Combine all feature columns into a single feature vector.
2.5 (2 points) Define the model. Use a logistic regression model LogisticRegression().
2.6 (8 points) Build the pipeline. Define a pipeline using pipeline() and then apply the pipeline model to the test dataset. Note that in this step, it may take a few minutes to build. Then display the predictions from the model.
Hint: Note that only FOUR stages in the pipeline. The output is similar to:
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme