(5/5)

This assignment must be completed using WEKA and the techniques covered within the module.

Download and unzip this folder of data: data-folder.zip . The data folder contains two items: a .arff file of data to use in Tasks 1 and 2 (the class label for this dataset is "HeartDisease"), and a folder of text data to use in Task 3.

For this assignment, you will also need your student ID number - take the last four digits of your ID number and use it as a randomisation seed, as demonstrated in the practical class. For example, if your ID number is 21012345, then you will need to use the number "2345" as the seed. This means that your results will be unique to you.

Task 1 (may be completed after Unit 6):

For this task, use the .arff file included in the data folder - the class label for this dataset is "HeartDisease".

Set the randomisation seed to the last four digits of your student ID number, and use a train:test split of 70%.

Using a libsvm model with the RBF kernel, create a table of 2 gamma x 2 cost values containing average result on the test data. Also include the confusion matrix from each of the four results. Explain how these results demonstrate the model's variation with respect to its parameter values.

Use a grid-search on an appropriate range to find optimised values for the gamma and cost parameters.

Task 2 (may be completed after Unit 7):

For this task, use the .arff file included in the data folder - the class label for this dataset is "HeartDisease".

Set the randomisation seed to the last four digits of your student ID number, and use a train:test split of 70%.

Using a J48 (decision-tree) model, create a table of 3 confidence factor values containing average result on the test data. Also include the confusion matrix from each of the three results. Explain how these results demonstrate the model's variation with respect to its parameter value.

Use a parameter-search on an appropriate range to find an optimised confidence factor value.

Now compare the libsvm and J48 models:

Perform 10-fold cross-validation with the libsvm model using the optimal values for cost and gamma you found in Task 1. Explain how you have done so and the results you obtained.

Perform 10-fold cross-validation with the J48 model, using the optimal value for the confidence factor you found in Task 2. Explain how you have done so and the results you obtained.

Is one of the models better than the other? Which one? Explain why you think so.

Task 3 (may be completed after Unit 8):

For this task, use the folder of text data included in the data folder.

Convert the text data into a set of attribute-value pairs. Explain how you have done so and describe the dataset that you have ended up with. Make your own choices with respect to:

use of TF/IDF or word counts use of a stoplist

use of a stemmer (do NOT try to use the Snowball stemmer, it has no effect) Balance the dataset using any technique you like. Explain how you have done so and how the dataset has changed. Make sure you save a copy of the dataset.

Optimise any required parameters and perform 10-fold cross-validation on each of NaiveBayesSimple, LibSVM and J48, producing a table of the three average results and a confusion matrix in each case. Explain how you have done so and the results you obtained.

Compare the performance of the three algorithms. Which is better? Why?

Submission requirements:

A single PDF document containing your report, to a maximum 10 pages. Write your student ID number at the start of your report. Do not include your name or other details, to keep the marking process anonymous.

Marks awarded for:

Task 1 (20 marks):

a) Screenshot of entering part of ID number for seed (2 marks)

b) Table of results and confusion matrices with 2-3 sentences highlighting the effect of changing parameter values (10 marks)

c)) Explanation of and results from using a grid-search to find the optimal parameters. Note: a screenshot of WEKA's dialog is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (8 marks)

Task 2 (30 marks):

a) Screenshot of entering part of ID number for seed (2 marks)

b) Table of results and confusion matrices, with 2-3 sentences highlighting the effect of changing confidence factor (8 marks)

c) Explanation of and results from using a parameter-search to find the optimal

confidence factor value. Note: a screenshot of WEKA's dialog is not acceptable and will be awarded 0 marks. You need to explain which parameters you are setting, and why. (8 marks)

d) Comparison: Explain in each case how you performed the cross-validation process, the parameter values you chose and set, and finally give 2-3 sentences evaluating the results (12 marks)

Task 3 (50 marks):

a) Conversion: marks awarded for explaining each of the steps and parameter values that you have used, and a description of your final dataset (15 marks)

b) Balancing: explain your choice of technique, any parameter values you had to set, and describe the resulting dataset (10 marks)

c) Optimise: for each algorithm, explain any parameter optimisation you had to perform, and the results of the 10-fold cross-validation (15 marks)

d) Compare: 5-6 sentences evaluating the differences in performance of the three algorithms, taking into account the average results and confusion matrices (10 marks)

(5/5)

CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,

Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This

7COM1028 Secure Systems Programming Referral Coursework: Secure

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme

Get Free Quote!

291 Experts Online