Assignment 3
K-Means Cluster Analysis
Cluster analysis has many useful applications. Here in this assignment we are going to try it on market segmentation. Evgeniou (2015) has a very nice introduction in this topic. The scenario below and the dataset are both from his work. See the references section for his full article.
Scenario:
“The management team of a large shopping mall would like to understand the types of people who are, or could be, visiting their mall. They have good reasons to believe that there are a few different market segments, and they are considering designing and positioning the shopping mall services better in order to attract mainly a few profitable market segments, or to differentiate their services (e.g. invitations to events, discounts, etc.) across market segments.” (Evgeniou, 2016)
The dataset
The Market Research Survey Questions
• V1: Shopping is fun (scale 1-7)
• V2: Shopping is bad for your budget (scale 1-7)
• V3: I combine shopping with eating out (scale 1-7)
• V4: I try to get the best buys while shopping (scale 1-7)
• V5: I don't care about shopping (scale 1-7)
• V6: You can save lot of money by comparing prices (scale 1-7)
• Income: the household income of the respondent (in dollars)
• Mall.Visits: how often they visit the mall (scale 1-7)
• Gender: m = male, f = female
Procedure:
We will follow Evgeniou’s 8-step approach to analyze the data. These steps will help you better understand the procedure to conduct a cluster analysis. If you are interested in how he explains these steps, please see his article cited in the references section at the end of this assignment instructions.
1. Confirm the data in metric.
The dataset that we are going to use is Data – Market Segmentation 3.xlsx, which is different from Evgeniou’s. Use Read Excel to read it.
Cluster analysis can only handle numeric data. All columns in the dataset except ‘gender’ are numeric. Therefore, all we need to do is determine whether the non-numeric variable can be transformed into numeric. Fortunately, ‘gender’ is one of the variables that can be transformed easily. Let’s use Generate Attributes to convert ‘gender’ into 0’s (female) and 1’s (male). A screenshot is shown below. See also the discussion in class for details.
Define ID variable as an identifier using the Set Role operator. Select all variables except the original Gender column before moving on.
2. Decide whether to scale or standardize the data.
V1 – V6 are in the 7-point Likert scale, but income and Mall.Visits are not. If you remember our discussion in class, variables with a wide range of values (e.g., income) will affect the Euclidean distance calculation a lot more than those with smaller ranges.
Here in this step you will normalize every variable to [0, 1]. Note: A variable specifically set with the identifier role will not be rescaled. Normalization will have no effect on it.
3. Decide which variables to use for clustering
We will use all variables except the original ‘gender’ variable. In real life, you will consider those variables that are relevant to your specific study.
Note: It is still OK to include the ID variable, since we had specifically defined it with an identifier role in a previous step. A variable with this role will not be used in calculation and model building even if you include it.
4. Define similarity measures between observations
There are several distance/similarity measures available for K-means. Since Euclidean Distance is the most popular one, let’s just use it.
5. Visualize individual attributes and pair-wise distance between the observations
We could examine the histograms for this purpose, but you do not have to do this step for this assignment.
6. Select the clustering method and decide how many clusters to have
a. We will use K-Means for this assignment. Be sure to check the “use local random seed” parameter in the K-Means operator, and set “local random seed” to 1992.
b. Use DBI to determine the best K. Do the following:
i. Enter your data and draw the line chart. Show K = 2, … 10.
ii. Q1: How did you select the best K? What is your best K and its DBI value? Answer this question in Assignment 4.docx. Be sure to include a screenshot of the DBI values.
iii. Move your design to a sub-process and disable the sub-process.
7. Profile and interpret the clusters
a. Q2:Re-design your process to perform k-means with the best K from the previous section. All the data pre-processing should be the same. The random seed should be the same as well. Replace the Optimize Parameter operator with the actual k-means analysis.(Note that your results may not a little different from Evgeniou’s because of the normalization procedure that we used.)
i. A screenshot of the cluster centroids
ii. A screenshot of the line plot of all cluster centroids.
iii. Q3: A paragraph or two to describe the difference variable-by-variable between the clusters of customers who do not like shopping.
iv. Q4: Locate the cluster that represents the customers who love shopping. Describe the patterns that you see about these customers variable-by-variable. Provide a strategy to your manager to engage these customers.
8. Assess the robustness of our clusters
a. Evgeniou recommends the following ideas. You do not have to do this part in the assignment, but it is a good idea to try out the ideas and see how the group membership changes.
i. using different subsets of the original data
ii. using variations of the original segmentation attributes
iii. using different distance metrics
iv. using different segmentation methods
v. using different numbers of clusters
What to submit
1. Rapidminer’s rmp file, and Assignment3.docx.
2. Assignment 3.docx. Make sure questions and your answers are cleared shown. If I have a hard time locating your answer(s), I will first take off 15% before grading your assignment. It is a good idea to include section headings and questions.
3. Hope you enjoy the real life data sets used in this assignment. Your learning should not stop here. See if you can uncover some additional insights using these datasets.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme