Classification using kNN (k nearest-neighbors)
We will be looking at the “Breast Cancer Wisconsin Diagnostic” data set, which is donated by researchers of the University of Wisconsin and includes measurements from digitized images of fine-needle aspirate of a breast mass. The data includes 30 numeric features which measure various characteristics related to shape and size of the mass, and a feature identifying the cancer diagnosis. The diagnosis feature is coded B for benign mass and M for malignant mass. Each example represents an independent cancer biopsy. In effect, we have 569 labeled examples. We would like to employ a classification algorithm that can learn from the data that we have and predict the diagnosis of new unlabeled examples, i.e., classify undiagnosed breast masses as either benign or malignant. Use R and present all your answers/explanations/visualizations in a Word document, using font ‘Courier New’.
The data file was recorded in comma-separated-variable form in file “wisc_bc_data”.
You will need to watch the tutorials to complete some of these task.
1. Which feature is your target and what type of feature is it? If this is a categorical/nominal feature be sure it’s coded as a Factor in R. What do we call the process of predicting a target variable of this type?
2. How many examples do we have in this data set? In your own words what does each example represent here?
3. Construct a table of frequencies and a separate table of relative frequencies for the target feature. We should have a good mix of examples at each level of our target for our algorithm to work well. Using the summary function quickly peruse the 30 other features.
4. The kNN algorithm is ‘scale-variant’, which means that features with larger values will have undue influence on the outcome. I have already ‘Normalized’ the numerical features for you in this data set, you should have noticed that when you looked at the summaries of the 30 numerical features in the previous question. (nothing to do on this question other than to confirm that the range of each potential predictor in the data set is 0 to 1)
5. Let’s split the data into a Training set and a Test set. First, we should mix up the examples to protect against any type of nonrandom ordering in the rows of the data. I have done the randomizing the rows step for you already. Knowing that they are well mixed, take the first 469 rows for our Training set and the last 100 for our Test set.
Tip: This will require you to subset the raw data set into 4 parts for eventual use in the kNN function, carefully selecting particular rows and columns by using square brackets. You must watch the video tutorials where I do this and explain.
6. With the Training and Test sets successfully split in the previous step, you can now use the kNN algorithm to predict the classifications, i.e., the diagnoses of the Test examples. You need a few items here:
1. Choose an integer for k, close to the square root of the number of examples in the Training data set.
2. Load the “class” package into your current R environment using code: library(class)
3. Define model with the knn function (template provided here): m1 <- knn(train= , test= , cl= , k=
7. Create a Confusion Matrix with the predicted diagnoses of the Test examples on the horizontal axis and their actual diagnoses on the vertical axis. Comment on the performance of our kNN algorithm. What was the Accuracy Rate expressed as a proportion? What was the Error Rate expressed as a proportion? Of the incorrectly classified examples which ones do you think are most dangerous in this context?
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme