You have been retained as a data scientist and suppose you have collected a dataset from UCI, excluding IRIS, of already-classified instances and you have to build an ensemble type classifier.
Ensembles can give you a boost in accuracy on your dataset. You can create ensembles of machine learning algorithms in R. There are three main techniques (Boosting, Bagging and Stacking) that you can create an ensemble of machine learning algorithms in R.
The three most popular methods for combining the predictions from different models are:
• Bagging: Building multiple models (typically of the same type) from different subsamples of the training dataset.
• Boosting: Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
• Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.
You can combine the predictions of multiple caret models using the caretEnsemble package. Given a list of caret models, the caretStack() function can be used to specify a higher-order model to learn how to best combine the predictions of sub-models together
This assignment is focused on Bagging and Stacking and on how you can continue to ratchet up the accuracy of the models on your own datasets.
The base type bagging machine learning algorithms that will be examined in this assignment are:
• Bagged CART,
• Random Forest
The base type stacking machine learning algorithms that will be examined in this assignment are
• Classification and Regression Trees (CART),
• K-Nearest Neighbors (KNN),
• Naïve Bayes (NB)
You need to select a data set of your own choice (i.e. you may use a dataset already used before in the lab, or from the literature review) for the purposes of building training and validating the above type of classifiers (Bagging, Stacking). With the aid of R package visualise and justify the properties of the selected data set.[15 Marks]
Assuming we have collected one large dataset of already-classified instances, you need to look into methods of forming training and test sets from this single dataset in R as described below.
The process of splitting the data into k-folds can be repeated a number of times; this is called Repeated k-fold Cross Validation (repeatedcv). The final model accuracy is taken as the mean from the number of repeats.[10 Marks]
You need to construct, train and test a Bagging type classifier in R, based on Bagged CART and Random Forest base classifiers. Train and test the Bagging classifier using the training and test sets generated based on the method tried as part of the 2nd Task.[20 Marks]
You need to construct, train and test a Stacking type classifier in R, based on (CART, KNN, NB). Train and test your Stacking classifier using the training and test sets generated based on the method tried as part of the 2nd Task. [25 Marks]
For each type of ensemble type classifier calculate and display the following performance related metrics in R. Critically comment on the importance of each metric for each type of ensemble type classifier. Use the library library(ROCR)
1. Confusion matrix
2. Precision vs. Recall
3. Accuracy
4. ROC(receiver operating characteristic curve)
5. RAUC (receiver under the curve area)
6. Training time
7. Testing time
8. Based on the above Metrics briefly discuss, how we can increase the reliability and consistency of the data classification task at hand. [30 Marks]
The Coursework will be marked based on the following marking criteria:
• Data Set summary of main properties 5
• Visualisation in R of main data set properties 5
• Feature Selection 5
Formation of training and test sets from in R using the methods below.
• Repeated CV for Bagging type classifier 5
• Repeated CV for Stacking type classifier 5
• Building of Random Forest type classifier in R 5
• Building of Bagged CART type classifier in R 5
• Testing of Bagging type classifier 10
• Building of Stacking CART classifier in R 5
• Building of Naïve Bayes type classifier in R 5
• Building of K-NN type classifier in R 5
• Testing of Stacking type classifier in R 10
• Confusion matrix estimation 4
• Precision vs. Recall estimation 4
• Accuracy estimation 4
• ROC(receiver operating characteristic curve) plot 3
• RAUC (receiver under the curve area) plot 3
• Training time 3
• Testing time 3
• Based on the above Metrics briefly discuss, how we can increase the reliability and consistency of the data classification task at hand. 6
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme