Naive Bayes classifier on Windows dataset
We will fit a naive Bayes classifier to classify emails into two categories, i.e. pertaining to the X Win- dow System (Y = 1) that is common in Unix operating systems, or the Microsoft Windows oper- ating system (Y = 2). The dataset windows.RData can be found in the Project folder on Moodle. We adopt a bag of words model to extract features of an email, i.e. X = (X1, X2, . . . , Xd) 0, 1 d is a binary vector denoting absence (Xi = 0) or presence (Xi = 1) of each word in the object vocab. These features and the corresponding class labels (last column) are stored in dataset.
1 Handling dataset
1. We will use the first N = 900 emails as our training dataset and the last M = 900 emails as our testing dataset. Implement this split and store the training and testing datasets as training and testing, respectively. [2 marks]
2. What class of object is vocab? [1 mark]
3. How many words are there in the bag of words? [1 mark]
4. Which positions in the bag of words correspond to the words “retrieval” and “subject”? [2 marks]
5. Compute the sample proportions of training emails that pertain to X Window and Microsoft Windows? [2 marks]
6. For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words “retrieval” and “subject”. [3 marks]
7. For training emails that pertain to MS Window, compute the sample proportion of these emails that contain the words “retrieval” and “subject”. [3 marks]
2 Fitting model
1. Discuss whether the assumption that the features X = (X1, X2, . . . , Xd) are conditionally independent given the class label is sensible? [2 marks]
2. Using the naiveBayes function in the e1071 package, fit a naive Bayes classifier on the training dataset. [2 marks]
3. Using the output of the naiveBayes function, what is the estimated prior distribution of the classes? Does this agree with your results in Question 1.5? [2 marks]
4. Using the output of the naiveBayes function, what is the estimated class conditional distri- butions of the features corresponding to the words “retrieval” and “subject”? Does this agree with your results in Questions 1.6–1.7? [3 marks]
5. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “retrieval” is present in a testing email. Explain your arguments clearly. [3 marks]
6. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “subject” is not present in a testing email. Explain your arguments clearly. [3 marks]
7. The behavior described in Questions 2.5–2.6 is known as overfitting. Explain why the naive Bayes classifier is overfitting the words “retrieval” and “subject”? [2 marks]
8. Investigate whether other words in vocab suffer from overfitting. Provide a short summary of your findings. [5 marks]
9. To mitigate the difficulties mentioned in Questions 2.5–2.6, one could consider prediction with modified probabilities, i.e. replace probabilities that are zero with a small positive threshold. Implement this prediction rule by specifying the threshold argument in the predict function. Compute the misclassification rates for thresholds 10−9, 10−6, 10−3 on the testing dataset, and explain which threshold we should prefer? [3 marks]
10. A principled approach to deal with overfitting, without modifying probabilities as considered in Question 2.9, is to learn model parameters using Bayesian inference. A specific instance of Bayesian inference in this setting amounts to adding the following four artificial data points to the training dataset:
(i) ) an email classified as X Window System that contains none of the words in the bag of words;
(ii) ) an email classified as X Window System that contains all the words in the bag of words; (iii ) an email classified as Microsoft Windows that contains none of the words in the bag of words;
(iv ) an email classified as Microsoft Windows that contains all the words in the bag of words; and fitting a naive Bayes classifier on the augmented dataset. Implement this approach and compute the misclassification rate on the testing dataset. [6 marks]
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme