Homework Assignment 3
The purpose of this problem is to give you some experience in clustering data sets. There are a large number of clustering methods available, but a very common one is k means clustering (which is considered a machine learning method).
This assignment gives you a data set of 768 patients. Each of the patients is a female member of the Pima Indian tribe in the southwest of the U.S. This tribe is known for its very high prevalence of diabetes.
You will use the patients in the data set to develop k means clustering models (using the R programming language) of whether the patient has diabetes or not for k =2 and k = 4 (note that in general, you need to examine a number of different k values to determine the optimal one, but for this assignment just use the two k values provided). In general, with machine learning methods, you need to divide the data set into training and test sets, but for simplicity you will not be asked to do so for this problem and only need to train the k means clustering model using all the data set as your training set.
The data set, named MI6426_hw_3_diabetes_data_set, is in the Course Content area of the Modules. This is a text file with comma delimiters between variables for a given patient (with each patient on a separate row) There are nine entries for each patients: eight for various measured characteristics of the patient and one for the diagnosis (not diabetes/diabetes). A number of the characteristics were not recorded and you need to account for that in your analysis. Below is a description of the text file entries and their possible values:
Column Description
1 Number of times pregnant
2 Plasma glucose concentration after 2 hours in an oral glucose tolerance test (mg/dl)
3 Diastolic blood pressure (mm Hg)
4 Triceps skin fold thickness (mm)
5 2-Hour serum insulin (mu U/ml)
6 Body mass index (weight in kg/(height in m)^2)
7 Diabetes pedigree function
8 Age (years)
9 Class variable (0 for not diabetes or 1 for diabetes)
The R source code to solve this problem is as follows (note that you will need to change the file reference for the data file since it refers to my directory path and file name):
library(cluster)
install.packages("factoextra")
library("factoextra")
dm_data <- read.table("C:/Users/gcravens/Documents/NSU/MI6426_hw_3_diabetes_data_set.txt", sep = ",")
col_nmes <- c('preg', 'glc', 'dbp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'class')
colnames(dm_data) <- col_names
for(i in c(2, 4)) {
kmeans_results <- kmeans(dm_data, i, iter.max = 10)
clusplot(dm_data, kmeans_results$cluster, color = TRUE, shade = TRUE, labels = 4, main = paste("Original data, K = ", i),
col.clus = c(1, 3, 5, 7), col.p = c(1, 3), plotchar = TRUE, lines = 0)
}
pca.tot <- prcomp(dm_data[, 1:8])
fviz_contrib(pca.tot, choice = "var")
fviz_pca_var(pca.tot, col.var = "contrib", gradient.cols = c("yellow", "blue", "red"), repel= TRUE)
dm_data <- dm_data[-which(dm_data$glc == 0), ]
dm_data <- dm_data[-which(dm_data$dbp == 0), ]
dm_data <- dm_data[-which(dm_data$skin == 0), ]
dm_data <- dm_data[-which(dm_data$insulin == 0), ]
dm_data <- dm_data[-which(dm_data$bmi == 0), ]
summary(dm_data)
for(i in c(2, 4)) {
kmeans_results <- kmeans(dm_data, i, iter.max = 10)
clusplot(dm_data, kmeans_results$cluster, diss = FALSE,
color = TRUE, shade = TRUE, labels = 4, main = paste("Missing data deleted, K = ", i), col.clus = c(2, 4, 6, 8),
col.p = c(2, 4), plotchar = TRUE, lines = 0, xlab = paste("Component = 1"), ylab = paste("Component = 2"))
}
pca.tot <- prcomp(dm_data[1:8,])
fviz_contrib(pca.tot, choice = "var")
fviz_pca_var(pca.tot, col.var = "contrib", gradient.cols = c("yellow", "blue", "red"), repel= TRUE)
Note that I have placed one or two bugs in the above code to give you experience debugging code. So, you need to debug the code and provide an exact description of the process you used to debug the code. You also must provide comments for each line of code as to what it is doing and why. Please execute the code and submit the resulting plots.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme