logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Kevin ConnellyComputer science
(4/5)

865 Answers

Hire Me
expert
Samuel BarberaMathematics
(5/5)

673 Answers

Hire Me
expert
Ruth HallPhilosophy
(5/5)

991 Answers

Hire Me
expert
Siddharth GuptaStatistics
(/5)

574 Answers

Hire Me
R Programming
(5/5)

The files you submit cannot be overwritten by anyone else, and they cannot be read by any other student.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Learning  outcomes  assessed

The learning outcomes assessed are:

develop, validate, evaluate, and use effectively machine learning models

apply methods and techniques such as decision trees and ensemble algo- rithms

extract value and insight from data

implement machine-learning algorithms in R

Instructions

In order to submit, copy all your submission  files into  a directory,  e.g.,  DA2, and create a compressed file, e.g., DA2.zip. Upload DA2.zip to Moodle through the following link:

Coursework Assignment 2 CS5100J You should submit files with the following names:

BDS.R should contain your source code for boosted decision stumps (these can be split into files such as BDS1.R for task 1, BDS2.R for task 2, and BDS3.R for task 3); please avoid submitting unnecessary files such as old versions, back-up copies made by the editor, etc.;

report.pdf should contain the numerical results and (optionally) discus- sion.

The files you submit cannot be overwritten by anyone else, and they cannot be read by any other student. You can, however, overwrite your submission as often as you like, by resubmitting, though only the last version submitted will be kept. Submissions after the deadline will be accepted but they will be automatically recorded as late and are subject to College Regulations on late submissions.

Tasks

You are to implement decision stumps (DS) and boosted decision stumps (BDS) for regression.  As explained in Chapter 6, decision stumps are decision trees with one split. Decision trees are covered in Chapter 1 and Lab Worksheet 2. Boosting is covered in Chapter 6 and Lab Worksheet 5. (See also Chapter 8 of [1].) For further details, see below.

Your programs should be written in R. You are not allowed to use any exist- ing implementations of decision trees or boosting in R, or any other language, and should code DS and BDS from first principles.

You should apply your DS and BDS programs to the Boston data set to predict medv given lstat and rm. (In other words, medv is the label and lstat and rm are the attributes). There is no need to normalize the attributes, of course.

Split the data set randomly into two equal parts, which will serve as the training set and the test set. Use your birthday (in the format MMDD) as the seed for the pseudorandom number generator. The same training and test sets should be used throughout this assignment.

Answer the following questions:

1. Train your DS implementation on the training set. Find the MSE on the test set. Include it in your report.

2. Train your BDS implementation on the training set for learning rate η =

0.01 and B = 1000 trees. Find the MSE on the test set. Include it in your report.

3. Plot the test MSE for a fixed value of η as a function of B ∈ [1, B0] (the number of trees) for as large B0 as possible. Do you observe overfitting? Include the plot and answer in your report.

Feel free to include in your report anything else that you find interesting.

Decision stumps

Decision trees in general are described on slides 39–53 of Chapter 1. Decision stumps are a special case corresponding to stopping after the first split. The description below is a streamlined (for this special case and for our data set) version of the general description.

The DS algorithm A decision stump is specified by its attribute (lstat or rm) and the threshold s. (Consider, e.g., s = 1.8, 1.9, . . . , 37.9 in the case of lstat and s = 3.6, 3.7, . . . , 8.7 in the case of rm.) The training RSS of a decision stump (lstat, s) is 

i:lsΣtati<s (yi − yˆ<)2 + i:lstati≥s (yi − yˆ≥)2

where both sums are over the training observations, yˆ<  is the mean label yi  for the  training  observations  satisfying  lstati  < s  and  yˆ≥ is  the  mean  label yi  for the training observations satisfying lstati s. The training RSS of a decision stump (rm, s) is defined similarly. Find the decision stump with the smallest training RSS. This will be your prediction rule.

Suppose the decision stump with the smallest training RSS is (rm, s) (i.e., this is your prediction rule). The test RSS of this decision stump is 

j:rΣmj <s (yj − yˆ<)2 + j:rmj≥s   (yj − yˆ≥)2 

where both sums are over the test observations, yˆ<  is the mean label yi  for the training  observations  satisfying  rmi  <  s,  and  yˆ≥  is  the  mean  label  yi  for  the training  observations  satisfying  rmi s. The  test  MSE  (to  be  given  in  your report) is the test RSS divided by the size m of the test set.

Remark.  Using both RSS and MSE is somewhat superfluous,  but both mea- sures are standard. Remember that the test MSE is the test RSS divided by the size m of the test set, and the training MSE is the training RSS divided by the size n of the training set.

Boosted decision stumps

Boosting regression trees is described on slide 23 of Chapter 6. The description below is a more detailed version of that description.

The BDS algorithm

1. Set fˆ(x) := 0 and ri := yi for all i = 1, . . . , n.

2. For b = 1, 2, . . . , B, repeat:

(a) fit a decision stump fˆb to the training data (xi, ri), i = 1, . . . , n; in other words, fˆb is the decision stump with the smallest training MSE

(b) remember the decision stump fˆb for future use

(c) update fˆ by adding in a shrunken version of the new stump: fˆ(x) :=

fˆ(x) + ηfˆb(x) (but we do not really need this!)

(d) update the residuals:  ri := ri − ηfˆb(xi), i = 1, . . . , n

The prediction rule is

Bfˆ(x) :=ηfˆb(x).b=1

To compute the test MSE (to be given in your report) of this prediction rule use the formula 

1 Σ   yj − Σ 2ηfˆb(xj) , j b=1

where the sum is over the test set and m is the size of the test set.

Marking criteria

To be awarded full marks you need both to submit correct code and to obtain correct results on the given data set. Even if your results are not correct, marks will be awarded for correct or partially correct code (up to a maximum of 75%). Correctly implementing decision stumps (Task 1) will give you at least 50%.

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme