Model Understanding: The objective of this report is to build a model to predict whether or not a loan was approved or denied. This approval or denial will serve as the binary response variable (1 = denial and 2 = approval) that the model aims to predict, and several metrics will be evaluated to gauge the capabilities of different models, including ROC curves, lift curves, and the confusion matrix. This report will also look at the topic of ensemble modeling, which is done to combine the outputs of several very distinct models into one modeling average for better prediction accuracy.
Data Understanding: There were over 166,000 total records in the data set. Certain variables were excluded from the creation of the model due to missing values when continuous and only having one variable level when nominal. For example, the state and year variable only had values of 2016 and LA (for Louisiana), respectively. STATE_FIPS was also excluded for the same reason, as it is simply a two-digit code representing Louisiana. RateSpread_num was also excluded from analysis because it contained over 150,000 missing values, too many to impute. Several continuous variables were candidates for imputation because they had a small number of missing values. HOEPADescription had only two different level values (non- HOEPA and HOEPA loans), and 99.97% of all loans in the data set were non-HOEPA. Therefore, this variable was also excluded. Several variables like race and loan type were binned. For example, VA and FHA loans had very similar denied = 1 rates. See Figure 5 for the differences in the response variable based on the type of loan being requested. This led to the creation of three bins for that variable in particular. The entire data set was then split for validation in the following proportions: 60% training, 30% validation and 10% test.
Analysis: After cleaning the data set and generating the validation column, different model types were then created to compare for prediction: logistic regression, boosted tree, k- nearest neighbor (k-NN), and neural network.
Confusion Matrix: All confusion matrix figures for each model are shown in Figure 1. The false positive rate represents the percentage of incorrect Denied = 1 predictions, and the false negative rate represents the number of incorrect Denied = 0 predictions. Sensitivity and specificity indicate the model’s ability to correctly predict true positives (Denied = 1) and true negatives (Denied = 0), respectively. In other words, the sensitivity shows the percentage of predicted denied = 1 classes that were truly denied, and specificity shows the percentage of predicted denied = 0 classes that were truly not denied. The overall error is calculated as the sum of all incorrect predictions divided by the total number of cases in the data set.
ROC Curve: The ROC curve is a metric that allows easy comparison between models. The classification models discussed in the Confusion Matrix section all predict the probability that an observation belongs the target class of Denied = 1 (with the exception of k-NN). A cutoff value (set to
0.5 in this case) is then chosen to decide if the probability (which is between 0 and 1) is high enough to generate a Denied
= 1 prediction. For example, if one prediction model generates a 0.51 value for a data point, that case would be predicted as Denied = 1 because it is higher than the 0.5 cutoff value. A comparison of all ROC curves is shown in Figure 2. The straight line shows a default prediction method that assigns classes at
random. The area between the curve of each model and the straight line is called the AUC (area under the curve), and the higher this value is, the better the model is at predicting Denied
= 1 when compared to random classifications. Having an ROC curve closer to the top left of the graph is more desirable. Before the ensemble model was created, the model with the highest AUC is the boosted neural network. The model with the lowest AUC is the logistic regression, meaning it is the worst out of all models when compared to random predictions at a cutoff value of 0.5. Figure 6 shows the cumulative gains curve, which indicates the percentage of the number of Denied = 1 cases “gained” by targeting a percentage of the total number of cases. For each of the models, including the ensemble, around 85% of all loans in the data are covered or gained in the first 50% of data.
Lift Curve: The lift curve is another method of model comparison. Like the ROC curve, it also compares the prediction capabilities of multiple models to a random classification. That random classification is flipping a coin in this case. A “good” lift curve is one that begins high above a value of 1 on the left side of the graph and then falls steeply to 1 towards the right side. Figure 3 shows the lift curves for the prediction models. Each model exhibits the qualities of a good lift curve. One note about the logistic regression lift curve shown in red: it starts high above 1 but not as high as the other three models. The other models appear to begin at values above 3, whereas the regression curve begins around a value of 2.
Conclusion: Figure 4 shows the measures of fit for all validation levels for each model developed for the report, besides the k-NN as JMP does not produce a statistic for it. This is because the k-NN model does not generate a probability nor does it really on a probability formula to create its predictions. Instead, this type of model is made by comparing every data point to every other data point, hence the consideration surrounding long computation times. The logistic regression model had the highest sensitivity of all, but also the highest false positivity rate at 42%. With this type of model, there is a concern of overfitting. Overfitting is also an issue when it comes to decision tree models, but the boosted tree platform has penalty settings to protect against that effect. The boosted tree model had the lowest sensitivity at 25%. Therefore, it is the worst at correctly predicting the true classes of Denied = 1. Neural networks have an advantage over regression models because they are able to capture complex relationships between predictors. With all of that being said, the logic behind ensemble modeling is simple: there is usually a benefit of combining multiple different models for prediction instead of just one. Because each model has its own aspects that pose limitations on its capabilities, averaging models together helps to increase the accuracy of prediction. Based on the metrics of the ensemble model produced for this report, it would be the best candidate for prediction purposes. Although it does not have the highest AUC or sensitivity, it has the second highest in each of those, and the lowest false positive rate. It also has the same false negative rate and overall error rate as the k-NN and boosted neural network. It offers a good balance of all evaluation metrics, instead of having only metric it is superior in.
References: Module 7 Resources
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme