Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Drop Files Here Or Click to Upload

Or Get Complete Course Help

Juber HasmiManagement

(4/5)

641 Answers

Hire Me

Rashmi AgnihotriComputer science

(5/5)

545 Answers

Hire Me

Ray BjorkkEngineering

(5/5)

831 Answers

Hire Me

Djemila BemroseBusiness

(5/5)

848 Answers

Hire Me

R Programming

(5/5)

For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Introduction to Data Science Project

Instructions

This project considers the use of classifiers to classify emails. Answer all questions. Save the R code you have written as a R script. All plots and discussion should be saved separately as a document. Submit all files via email to heng@essec.edu by 10pm on 14 June 2021. Late submissions will not be accepted.

Naive Bayes classifier on Windows dataset

We will fit a naive Bayes classifier to classify emails into two categories, i.e. pertaining to the X Win- dow System (Y = 1) that is common in Unix operating systems, or the Microsoft Windows oper- ating system (Y = 2). The dataset windows.RData can be found in the Project folder on Moodle. We adopt a bag of words model to extract features of an email, i.e. X = (X1, X2, . . . , Xd) 0, 1 d is a binary vector denoting absence (Xi = 0) or presence (Xi = 1) of each word in the object vocab. These features and the corresponding class labels (last column) are stored in dataset.

1 Handling dataset

1. We will use the first N = 900 emails as our training dataset and the last M = 900 emails as our testing dataset. Implement this split and store the training and testing datasets as training and testing, respectively. [2 marks]

2. What class of object is vocab? [1 mark]

3. How many words are there in the bag of words? [1 mark]

4. Which positions in the bag of words correspond to the words “retrieval” and “subject”? [2 marks]

5. Compute the sample proportions of training emails that pertain to X Window and Microsoft Windows? [2 marks]

6. For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words “retrieval” and “subject”. [3 marks]

7. For training emails that pertain to MS Window, compute the sample proportion of these emails that contain the words “retrieval” and “subject”. [3 marks]

2 Fitting model

1. Discuss whether the assumption that the features X = (X1, X2, . . . , Xd) are conditionally independent given the class label is sensible? [2 marks]

2. Using the naiveBayes function in the e1071 package, fit a naive Bayes classifier on the training dataset. [2 marks]

3. Using the output of the naiveBayes function, what is the estimated prior distribution of the classes? Does this agree with your results in Question 1.5? [2 marks]

4. Using the output of the naiveBayes function, what is the estimated class conditional distri- butions of the features corresponding to the words “retrieval” and “subject”? Does this agree with your results in Questions 1.6–1.7? [3 marks]

5. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “retrieval” is present in a testing email. Explain your arguments clearly. [3 marks]

6. Given the estimates obtained in Question 2.4, show analytically that the naive Bayes classifier will predict both classes are impossible if the word “subject” is not present in a testing email. Explain your arguments clearly. [3 marks]

7. The behavior described in Questions 2.5–2.6 is known as overfitting. Explain why the naive Bayes classifier is overfitting the words “retrieval” and “subject”? [2 marks]

8. Investigate whether other words in vocab suffer from overfitting. Provide a short summary of your findings. [5 marks]

9. To mitigate the difficulties mentioned in Questions 2.5–2.6, one could consider prediction with modified probabilities, i.e. replace probabilities that are zero with a small positive threshold. Implement this prediction rule by specifying the threshold argument in the predict function. Compute the misclassification rates for thresholds 10−9, 10−6, 10−3 on the testing dataset, and explain which threshold we should prefer? [3 marks]

10. A principled approach to deal with overfitting, without modifying probabilities as considered in Question 2.9, is to learn model parameters using Bayesian inference. A specific instance of Bayesian inference in this setting amounts to adding the following four artificial data points to the training dataset:

(i) ) an email classified as X Window System that contains none of the words in the bag of words;

(ii) ) an email classified as X Window System that contains all the words in the bag of words; (iii ) an email classified as Microsoft Windows that contains none of the words in the bag of words;

(iv ) an email classified as Microsoft Windows that contains all the words in the bag of words; and fitting a naive Bayes classifier on the augmented dataset. Implement this approach and compute the misclassification rate on the testing dataset. [6 marks]

3 Feature selection

In this section, we will consider a methodology to remove words from the bag of words that are not so relevant for the classification task. The feature selection procedure first measures the relevance

of each word by computing the mutual information between each feature Xi ∈ {0, 1} and the class label Y ∈ {1, 2}, defined as

I(X , Y ) = Σ

Σ log P (Xi = x, Y = y) P (X= x, Y = y)

for word i = 1, 2, . . . , d. We then rank words according to this measure of relevance, and select the top K ∈ {1, 2, . . . , d} features to fit the naive Bayes classifier.

1. Show analytically that I(Xi, Y ) = 0 if Xi and Y are independent. Explain why the mutual information provides a sensible measure of word relevance for the classification task. [2 marks]

2. Show analytically that

I(X , Y ) = log 1 − θi,1 (1 − π)(1 − θ )

+ log 1 − θi,2 π(1 − θ )

+ log θi,1 (1 − π)θ

+ log θi,2 πθ

where P (Y = 2) = π, P (Xi = 1|Y = 1) = θi,1 and P (Xi = 1|Y = 2) = θi,2. [3 marks]

3. Using the parameters estimated in Question 2.10 and the expression in Question 3.2, compute the mutual information for each word in the bag of words. [6 marks]

4. What are the words with the top five highest mutual information? By running a Google search on these words (if they are unfamiliar), explain why they are most relevant for this classification task. [4 marks]

5. Using the augmented training dataset in Question 2.10, compute the misclassification rate when selecting only the top K features to fit the naive Bayes classifier for all K 1, 2, . . . , d . Plot the misclassification rate against K. Report the value of K that minimizes the misclas- sification rate and briefly discuss your findings [7 marks]

(5/5)

Hurry, Grab up to 30% discount on the entire course

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Juber HasmiManagement

Rashmi AgnihotriComputer science

Ray BjorkkEngineering

Djemila BemroseBusiness

R Programming

For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words

ANSWER ALL QUESTIONS

Introduction to Data Science Project

Instructions

Naive Bayes classifier on Windows dataset

1 Handling dataset

2 Fitting model

3 Feature selection

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

Other Services

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Juber HasmiManagement

Rashmi AgnihotriComputer science

Ray BjorkkEngineering

Djemila BemroseBusiness

R Programming

For training emails that pertain to X Window, compute the sample proportions of these emails that contain the words

ANSWER ALL QUESTIONS

Introduction to Data Science Project

Instructions

Naive Bayes classifier on Windows dataset

1 Handling dataset

2 Fitting model

3 Feature selection

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer