logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Joan DomettEngineering
(4/5)

921 Answers

Hire Me
expert
Pierre BernierData mining
(5/5)

539 Answers

Hire Me
expert
Diksha DuaData mining
(5/5)

657 Answers

Hire Me
expert
David BergenBusiness
(5/5)

707 Answers

Hire Me
Weka
(5/5)

This assignment must be completed using WEKA and the techniques covered within the module.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

This assignment must be completed using WEKA and the techniques covered within the module.

 

Download and unzip this folder of data: data-folder.zip . The data folder contains two items: a .arff file of data to use in Tasks 1 and 2 (the class label for this dataset is "HeartDisease"), and a folder of text data to use in Task 3.

 

For this assignment, you will also need your student ID number - take the last four digits of your ID number and use it as a randomisation seed, as demonstrated in the practical class. For example, if your ID number is 21012345, then you will need to use the number "2345" as the seed. This means that your results will be unique to you.

 

Task 1 (may be completed after Unit 6):

 

For this task, use the .arff file included in the data folder - the class label for this dataset is "HeartDisease".

Set the randomisation seed to the last four digits of your student ID number, and use a train:test split of 70%.

Using a libsvm model with the RBF kernel, create a table of 2 gamma x 2 cost values containing average result on the test data. Also include the confusion matrix from each of the four results. Explain how these results demonstrate the model's variation with respect to its parameter values.

Use a grid-search on an appropriate range to find optimised values for the gamma and cost parameters.

Task 2 (may be completed after Unit 7):

 

For this task, use the .arff file included in the data folder - the class label for this dataset is "HeartDisease".

Set the randomisation seed to the last four digits of your student ID number, and use a train:test split of 70%.

Using a J48 (decision-tree) model, create a table of 3 confidence factor values containing average result on the test data. Also include the confusion matrix from each of the three results. Explain how these results demonstrate the model's variation with respect to its parameter value.

Use a parameter-search on an appropriate range to find an optimised confidence factor value.

Now compare the libsvm and J48 models:

 

Perform 10-fold cross-validation with the libsvm model using the optimal values for cost and gamma you found in Task 1. Explain how you have done so and the results you obtained.

Perform 10-fold cross-validation with the J48 model, using the optimal value for the confidence factor you found in Task 2. Explain how you have done so and the results you obtained.

Is one of the models better than the other? Which one? Explain why you think so.

Task 3 (may be completed after Unit 8):

 

 For this task, use the folder of text data included in the data folder.

Convert the text data into a set of attribute-value pairs. Explain how you have done so and describe the dataset that you have ended up with. Make your own choices with respect to:

use of TF/IDF or word counts

use of a stoplist

use of a stemmer (do NOT try to use the Snowball stemmer, it has no effect)

Balance the dataset using any technique you like. Explain how you have done so and how the dataset has changed. Make sure you save a copy of the dataset.

Optimise any required parameters and perform 10-fold cross-validation on each of NaiveBayesSimple, LibSVM and J48, producing a table of the three average results and a confusion matrix in each case.  Explain how you have done so and the results you obtained.

Compare the performance of the three algorithms. Which is better? Why?

 

 

Submission requirements:

 

A single PDF document containing your report, to a maximum 10 pages. Write your student ID number at the start of your report. Do not include your name or other details, to keep the marking process anonymous.

 

 

Marks awarded for:

 

Task 1 (20 marks):

a) Screenshot of entering part of ID number for seed (2 marks)

 

b) Table of results and confusion matrices with 2-3 sentences highlighting the effect of changing parameter values (10 marks)

 

c) Explanation of and results from using a grid-search to find the optimal parameters. Note: a screenshot of WEKA's dialog  is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (8 marks)

 

Task 2 (30 marks):

 

a) Screenshot of entering part of ID number for seed (2 marks)

 

b) Table of results and confusion matrices, with 2-3 sentences highlighting the effect of changing confidence factor (8 marks)

 

c) Explanation of and results from using a parameter-search to find the optimal confidence factor value. Note: a screenshot of WEKA's dialog is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (8 marks)

 

d) Comparison: Explain in each case how you performed the cross-validation process, the parameter values you chose and set, and finally give 2-3 sentences evaluating the results (12 marks)

 

Task 3 (50 marks):

 

a) Conversion: marks awarded for explaining each of the steps and parameter values that you have used, and a description of your final dataset (15 marks)

 

b) Balancing: explain your choice of technique, any parameter values you had to set, and describe the resulting dataset (10 marks)

 

c) Optimise: for each algorithm, explain any parameter optimisation you had to perform, and the results of the 10-fold cross-validation (15 marks)

 

d) Compare: 5-6 sentences evaluating the differences in performance of the three algorithms, taking into account the average results and confusion matrices (10 marks)

 

 

A reminder that all work should be your own. Reports without screenshots showing the use of your student ID number as a seed will be marked as zero.

Reports exceeding the maximum length may not be marked beyond the 10 page limit.

Apart from the two required screenshots showing your use of your student ID number, no screenshots should be included in your report. Other data required from WEKA should be placed into your own tables or other results, such as confusion matrices, can be copy-pasted into your report.

No research beyond what was covered in the module is required, and so your report should not include citations or a reference list.

 

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme