- {StudentID}_A2_Q1.html
- {StudentID}_A2_Q1.ipynb
- {StudentID}_A2_Q2.html
- {StudentID}_A2_Q2.ipynb
For this assignment, please follow the additional instructions below:
- Please keep your answers concise and to-the-point. For instance, please do not copy & paste excessive material from Excel (if that is what you are using).
- You must write all your narrative in your Jupyter notebook and you must clearly explain all your steps in clear English.
- You must pay attention to good presentation practices with section headers, correct spelling, etc.
- At the end, for each question part, you must present your solutions as either appropriate text, or as the requested Pandas data frames in your Jupyter notebook. If you are using Excel, you can populate these Pandas data frames from Excel output.
This question is inspired from Exercise 3 in Chapter 4 in the textbook. On Canvas, you will see a CSV file named "A2_Q1.csv". You will use this dataset for this question. In our version of the dataset, the annual income
target variable has been renamed as low
, mid
, and high
.
For this question, you will build a simple decision tree with depth 1 using this dataset for predicting the annual income
target feature using the Gini Index split criterion. You will present your results as Pandas data frames.
You are allowed to use any Python code available on our website here. In fact, you are recommended to use some of this code.
Compute the impurity of the target feature.
In this part, you will determine the root node for your decision tree. Please refer Chapter 4 slides on Canvas and Part (c) of this exercise question in the textbook for handling the Age
continuous feature.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_splits". Also, it needs to have the following columns:
You can populate this data frame line by line by referring to Cell 6 in our Pandas tutorial.
You will populate and display your df_splits
data frame. As an example for your "df_splits" data frame, consider the spam prediction
example in Table 4.2 in the textbook on page 121, which was also covered in lectures. The df_splits
data frame would look something like the table below.
Split | Remainder | Information_Gain | Is_Optimal |
---|---|---|---|
suspicious words | ? | ? | True |
unknown sender | ? | ? | False |
contains images | ? | ? | False |
PART B CLARIFICATION: In your "df_splits" data frame, please do not bundle all age splits together and call it "Age". Rather, please have a separate row for each age threshold value that qualifies as a split candidate. Please name these splits as "Age_{YY}" where {YY} represents a numerical age threshold (in years). For example, if there are only two threshold values that qualify as a split candidate, say 20 and 30, you would add two rows to "df_splits" with Split
values of "Age_20" and "Age_30". Apparently, you will need to populate the rest of the columns with correct values for these two rows.
In this part, you will assume the Education
descriptive feature is at the root node (NOTE: This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the annual income
target variable.
Your answer to this part needs to be a Pandas data frame and it needs to be called "df_prediction". Also, it needs to have the following columns:
Assuming that Education
is the root node, you will populate and display your df_prediction
data frame. As an example, continuing the spam prediction problem, assume the suspicious words
descriptive feature is at the root node. The df_prediction
data frame would look something like the table below.
Leaf_Condition | Spam_Prob | Ham_Prob | Leaf_Prediction |
---|---|---|---|
suspicious words == true | ? | ? | ? |
suspicious words == false | ? | ? | ? |
HINT: Your df_prediction
data frame should have only 3 rows.
For marking purposes, please display your "df_splits" and "df_prediction" data frames (in separate cells!) as the last thing in your notebook for this question. Thank you.
This question is inspired from Exercise 6 in Chapter 8 in the textbook. On Canvas, you will see a CSV file named "A2_Q2.csv". You will use this dataset for this question.
Assume true is the positive target level. Using a score threshold of 0.5, work out the confusion matrix (using pd.crosstab()
or as a Jupyter notebook table) with appropriate row and column labels.
Compute the following 5 metrics:
You will need to display your answers as a Pandas data frame called "df_metrics" (with 5 rows, one row for each metric) with the following 2 columns:
Marking Note for Part B: If your confusion matrix is incorrect, you will not get full credit for a correct follow-through.
By varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values. You will need to display your answers as a Pandas data frame called "df_roc" with the following columns:
HINT: Your df_roc
data frame should have 9 rows. For this part and the next, you might find Cells #20 and #21 useful in our SK4 Tutorial here.
Using your answer in the above part, display an ROC curve with appropriate axes labels and a title.
For marking purposes, please display your "df_metrics" and "df_roc" data frames (in separate cells!) as the last thing in your notebook right before your plot for Part D. Thank you.
www.featureranking.com