logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Shilpi BhatnagarComputer science
(5/5)

505 Answers

Hire Me
expert
Adetayo OjeniranStatistics
(/5)

937 Answers

Hire Me
expert
Bryce KimEnglish
(5/5)

896 Answers

Hire Me
expert
Eugene BaranowskiMathematics
(5/5)

707 Answers

Hire Me
R Programming
(5/5)

You will be working with a random sample of 10000 of these policies. Instructions for generating

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Data Mining Assignment

This assignment relates to motor insurance claim data. In every tutorial class we will spend some time working on this assignment and at the end of week 5 you will be expected to submit your part 1 assignment. There are two data sets used for this assignment.

Description of the first dataset

There is data for more than 73000 policies in the data file motor20pct.csv that are associated with claims in a particular year. The variables for each of the policies in this data set are explained below:-

CAR_AGE measures the age of the insured car in years

DRIVERS measures the number of people who are specified as designated drivers EXPOSURE measures the fraction of the year for which the policy was active MILEAGE measures the expected mileage travelled in a single year

PRIMAGE gives the age of the primary driver in years

TOTAL gives the total amount claimed on the policy in the year

EXCESS = 0, 75 or 100 indicating the excess claim amount associated with each policy. The insurance company will not pay out claims below this excess amount.

USAGE specifies how the car is used (S=only social, SB=strictly business, SC=social and business, ST=social and taxi)

CLAIM=1 if there was at least one claim during the year, 0 otherwise.

Create a variable called CatClaim set equal to “Yes” when CLAIM=1 and “No” for CLAIM=0.

You will be working with a random sample of 10000 of these policies. Instructions for generating this sample are provided in Tutorial 1.

Question 1 (5 marks)

Suggest a list of questions that could be answered using this data in your assignment. Consider the CLAIM variable as a possible TARGET variable in your models and the total claim amount as a possible RISK variable, reflecting the risk associated with any claim.

Question 2 (20 marks)

Instructions for this question are provided in Tutorials 2 and 3.

Summarise your data using descriptive statistics and graphs. Some suggestions are provided below. All tables and graphs must be numbered/labelled and discussed/interpreted.

i) Produce summary statistics for your data

i) Boxplots for numeric input variables for claim categories

ii) Pairs plot for all numeric input variables

iii) Correlation Plot for all numeric input variables

iv) Hierarchical Correlation Plot for all numeric input variables

v) Bar charts for the categorical variables Usage and the claim variable

vi) Other exciting plots 

Question 3: (15 marks)

Tutorial 5 provides guidelines for this question

Partition your data with 70% for training, 15% for validation and 15% for testing. Number and label all your tables and graphs and discuss/interpret the results.

a) Produce a Tree to predict CatClaim. Then Draw your tree and ask for the Rules.

b) How do the results change when you re-run your tree assuming a loss matrix with losses half as big for a false positive (CatClaim=”Yes”) than a false negative (CatClaim=”No”).

c) How do your results change when you re-run your tree assuming priors of 20% for CatClaim = Yes and 80% for CatClaim = No. These were the percentages for the original data file “motor20pct”.

Description for the second data set

For this question consider the data set MBAmotor2.csv which was created using MBAmotor.csv. This file tells us what type of claim was posted by each of the policy holders during the year. There is at least one type of claim for all these policies.

WSCLMS=WS for windshield claims ADCLMS=AD for accidental damage FTCLMS =FT for fire or theft

PDCLMS = PD for personal damage claims PICLMS = PI for personal injury claims

Question 4 (10 marks)

Tutorial 4 provides guidelines for this question.

Conduct an association analysis using these data and discuss your results. In particular, you should define the terms support and confidence and determine the strongest and the most common associations between the above types of claim. Number and label all tables and figures and discuss/interpret the results.

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme