(5/5)

Hire Me
(5/5)

Hire Me
(5/5)

Hire Me
(5/5)

Hire Me
(5/5)

# We will fit a naive Bayes classifier to classify emails into two categories, i.e. pertaining to the X Win- dow System

INSTRUCTIONS TO CANDIDATES

Naive Bayes classifier on Windows dataset

We will fit a naive Bayes classifier to classify emails into two categories, i.e. pertaining to the X Win- dow System (Y = 1) that is common in Unix operating systems, or the Microsoft Windows oper- ating system (Y = 2). The dataset windows.RData can be found in the Project folder on Moodle. We adopt a bag of words model to extract features of an email, i.e.  X = (X1, X2, . . . , Xd)     0, 1 d is a binary vector denoting absence (Xi = 0) or presence (Xi = 1) of each word in the object vocab. These features and the corresponding class labels (last column) are stored in dataset.

1 Handling dataset

1. We will use the first N = 900 emails as our training dataset and the last M = 900 emails as our testing dataset. Implement this split and store the training and testing datasets as training and testing, respectively. [2 marks]

2. What class of object is vocab? [1 mark]

3. How many words are there in the bag of words? [1 mark]

4. Which positions in the bag of words correspond to the words “retrieval” and “subject”? [2 marks]

5. Compute the sample proportions of training emails that pertain to X Window and Microsoft Windows? [2 marks]

6. For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words “retrieval” and “subject”. [3 marks]

7. For training emails that pertain to MS Window, compute the sample proportion of these emails that contain the words “retrieval” and “subject”. [3 marks]

2 Fitting model

1. Discuss whether the assumption that the features X = (X1, X2, . . . , Xd) are conditionally independent given the class label is sensible? [2 marks]

2. Using the naiveBayes function in the e1071 package, fit a naive Bayes classifier on the training dataset. [2 marks]

3. Using the output of the naiveBayes function, what is the estimated prior distribution of the classes? Does this agree with your results in Question 1.5? [2 marks]

4. Using the output of the naiveBayes function, what is the estimated class conditional distri- butions of the features corresponding to the words “retrieval” and “subject”? Does this agree with your results in Questions 1.6–1.7? [3 marks]

5. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “retrieval” is present in a testing email. Explain your arguments clearly. [3 marks]

6. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “subject” is not present in a testing email. Explain your arguments clearly. [3 marks]

7. The behavior described in Questions 2.5–2.6 is known as overfitting. Explain why the naive Bayes classifier is overfitting the words “retrieval” and “subject”? [2 marks]

8. Investigate whether other words in vocab suffer from overfitting. Provide a short summary of your findings. [5 marks]

9. To mitigate the difficulties mentioned in Questions 2.5–2.6, one could consider prediction with modified probabilities, i.e. replace probabilities that are zero with a small positive threshold. Implement this prediction rule by specifying the threshold argument in the predict function. Compute the misclassification rates for thresholds 10−9, 10−6, 10−3 on the testing dataset, and explain which threshold we should prefer? [3 marks]

10. A principled approach to deal with overfitting, without modifying probabilities as considered in Question 2.9, is to learn model parameters using Bayesian inference. A specific instance of Bayesian inference in this setting amounts to adding the following four artificial data points to the training dataset:

(i) ) an email classified as X Window System that contains none of the words in the bag of words;

(ii) ) an email classified as X Window System that contains all the words in the bag of words; (iii ) an email classified as Microsoft Windows that contains none of the words in the bag of words;

(iv ) an email classified as Microsoft Windows that contains all the words in the bag of words; and fitting a naive Bayes classifier on the augmented dataset. Implement this approach and compute the misclassification rate on the testing dataset. [6 marks]

(5/5)

## Related Questions

##### . The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

##### . Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

##### . The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

##### . Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

##### . The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme