This question is inspired by Exercise 3 in Chapter 4 in the textbook. On Canvas, you will see a CSV file named "A2_Q1.csv". You will use this dataset for this question. In our version of the dataset, the annual income target variable has been renamed as low, mid, and high.
For this question, you will build a simple decision tree with depth 1 using this dataset for predicting the annual income target feature using the Gini Index split criterion. You will present your results as Pandas data frames.
You are allowed to use any Python code available on our website here. In fact, you are recommended to use some of this code.
Compute the impurity of the target feature.
In this part, you will determine the root node for your decision tree. Please refer Chapter 4 slides on Canvas and Part (c) of this exercise question in the textbook for handling the Age continuous feature.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_splits". Also, it needs to have the following columns:
You can populate this data frame line by line by referring to Cell 6 in our Pandas tutorial.
You will populate and display your df_splits data frame. As an example for your "df_splits" data frame, consider the spam prediction example in Table 4.2 in the textbook on page 121, which was also covered in lectures. The df_splits data frame would look something like the table below.
Split | Remainder | Information_Gain | Is_Optimal |
---|---|---|---|
suspicious words | ? | ? | True |
unknown sender | ? | ? | False |
contains images | ? | ? | False |
PART B CLARIFICATION: In your "df_splits" data frame, please do not bundle all age splits together and call it "Age". Rather, please have a separate row for each age threshold value that qualifies as a split candidate. Please name these splits as "Age_{YY}" where {YY} represents a numerical age threshold (in years). For example, if there are only two threshold values that qualify as a split candidate, say 20 and 30, you would add two rows to "df_splits" with Split values of "Age_20" and "Age_30". Apparently, you will need to populate the rest of the columns with correct values for these two rows.
In this part, you will assume the Education descriptive feature is at the root node (NOTE: This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the annual income target variable.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_prediction". Also, it needs to have the following columns:
Assuming that Education is the root node, you will populate and display your df_prediction data frame. As an example, continuing the spam prediction problem, assume the suspicious words descriptive feature is at the root node. The df_prediction data frame would look something like the table below.
Leaf_Condition | Spam_Prob | Ham_Prob | Leaf_Prediction |
---|---|---|---|
suspicious words == true | ? | ? | ? |
suspicious words == false | ? | ? | ? |
HINT: Your df_prediction data frame should have only 3 rows.
For marking purposes, please display your "df_splits" and "df_prediction" data frames (in separate cells!) as the last thing in your notebook for this question. Thank you.
This question is inspired from Exercise 6 in Chapter 8 in the textbook. On Canvas, you will see a CSV file named "A2_Q2.csv". You will use this dataset for this question.
Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab() or as a Jupyter notebook table) with appropriate row and column labels.
Compute the following 5 metrics:
You will need to display your answers as a Pandas data frame called "df_metrics" (with 5 rows, one row for each metric) with the following 2 columns:
Marking Note for Part B: If your confusion matrix is incorrect, you will not get full credit for a correct follow-through.
By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. You will need to display your answers as a Pandas data frame called "df_roc" with the following columns:
HINT: Your df_roc data frame should have 9 rows. For this part and the next, you might find Cells #20 and #21 useful in our SK4 Tutorial here.
Using your answer in the above part, display an ROC curve with appropriate axes labels and a title.
For marking purposes, please display your "df_metrics" and "df_roc" data frames (in separate cells!) as the last thing in your notebook right before your plot for Part D. Thank you.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme