Better World Shopping mall (BWSM) is a shopping center that specifically caters to the apparel needs of the urban area residents. Since last year, its revenue has been declining and many retail shops have decided to move out from the mall. The management of BWSM wants to study the amount of money that their customers would spend on shopping and divides them into groups for promotion. The management decided to use clustering to better profile their customers according to their demographics and spending power. You are now given a dataset (BWSM.csv) as shown in Table 1 to help BWSM to do this data mining project.
Table 1. Description of BWSM.csv
Attribute |
Description |
Labels/Values |
CustomerID |
Unique identifier of the customer |
Unique code |
Gender |
Gender of the customer |
“M” for Male / “F” for Female |
Education |
Education level of the customer |
“1” for High school or below / “2” for Bachelor’s degree / “3” for Master’s degree or above |
Age |
Age of the customer |
Integer measurement |
Income |
Annual Income of the customer |
Dollar measurement |
Household |
Household type of the customer |
“1” for Single / “2” for Couple / “3” for Family with children / “4” for Extended family (i.e. children and grandparents) / “5” for Others |
VisitFrequency |
How many times the customer visits BWSM per month |
Integer measurement |
AvgSpent |
The average amount the customer spent in the BWSM per visit |
Dollar measurement |
(a) With reference to the CRISP-DM framework, discuss how you plan to carry out this data mining project. (24 marks)
(b) Analyse the data based on the summary statistics given in Table 2.
Table 2. Summary statistics of the attributes
(i) Explain if there is a need to perform data transformation. (4 marks)
(ii) Describe a scenario where z-score normalization is preferable to min-max normalisation. In your answer, differentiate these two (2) categories of data normalisation techniques. (4 marks)
(c) Assume that you have built two clustering models, Model A and Model B. The details of each model are given in Table 3. In each model, you are able to clearly describe the profile of each cluster. Based on Table 3, identify the model that you believe is better for deployment. Defend your choice by providing good reasons.
Table 3. Description of Model A and Model B
Description |
Model A |
Model B |
Number of clusters |
3 |
5 |
Number of clustering criteria |
4 |
4 |
Ease of interpretation of the profile of each cluster |
High |
High |
Average Silhouette coefficient |
0.75 |
0.79 |
Size of each cluster |
Cluster 1: 23% Cluster 2: 46% Cluster 3: 31% |
Cluster 1: 17% Cluster 2: 25% Cluster 3: 26% Cluster 4: 20% Cluster 5: 12% |
Question 2
You are a data scientist in the Quality Control Department of a wine production company. You would like to understand the factors affecting the wine quality by developing a classification tree that can predict whether the wine is of “Low quality” or “High quality”. A dataset related to a particular type of white wine produced by your company was collected. The number of instances in the white wine samples are 4898.
In the dataset (winequality-white.csv), there are 11 attributes related to physicochemical properties of the wine and 1 attribute “Quality” indicating the quality of the wine. Table 4 shows the attributes in the winequality-white.csv and the range of each attribute.
Table 4. Description of winequality-white.csv
Attribute (unit) |
Range |
fixed acidity (g(tartaric acid)/dm3) |
3.8-14.2 |
volatile acidity (g(acetic acid)/dm3) |
0.1-1.1 |
citric acid (g/dm3) |
0-1.7 |
residual sugar (g/dm3) |
0.6-65.8 |
chlorides (g(sodium chloride)/dm3) |
0.01-0.35 |
free sulfur dioxide (mg/dm3) |
2-289 |
total sulfur dioxide (mg/dm3) |
9-440 |
density (g/cm3) |
0.987-1.039 |
pH |
2.7-3.8 |
sulphates (g(potassium sulphate)/dm3) |
0.2-1.1 |
alcohol (vol.%) |
8.0-14.2 |
quality |
3-9 |
a) The quality of wine is initially determined by 30 wine experts in a scale that ranges from 0 (bad) to 10 (excellent) using blind wine tasting. The attribute “quality” in the dataset is the final wine quality score based on the 30 scores. Discuss whether the attribute “quality” should be the “mode”, “mean” or “median” of the scores given by the wine experts. Provide an explanation to support your answer. (6 marks)
(b) Based on the attribute “quality”, you decided to perform binning in the IBM SPSS Modeler to categorise the wine quality into two classes: Low Quality and High Quality. After the Binning node has been executed, there is a new attribute “quality_BIN” created with two bin values: “1” and “2”. Figure 1 shows the setting in the Binning Node. Discuss the purpose of binning and describe the meaning of the two bins in this context.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme