Part I Small Course Project (20 Marks)
Given the following dataset which contains 14 two-dimensional data objects (each data object has its own x and y coordinates) and we want to carry out a mini data mining project to perform k- Means clustering on this dataset to generate clusters. We use the Euclidean distance to calculate the distance between a pair of data objects in the clustering.
Index x y
1 1.5 1
2 1 1.5
3 1.7 1.2
4 0.7 1.1
5 4 9
6 3.9 7.7
7 4.1 8
8 4.3 7.9
9 3.7 7
10 3.7 1
11 4.2 0.4
12 4 1.5
13 4.3 1.8
14 0.8 7.2
Students are encouraged to manually produce the clustering results in this project to fully understand how k-Means clustering works.
Answer the following questions:
Q1. How many iterations are needed for k-Means clustering (In other words, how many times you have to run k-Means before it can be terminated) on the dataset when k=2 with the initial cluster centres being (0.8, 7.2) and (0.7, 1.1). Please present the result of each iteration of the clustering process as well as visualize the final clustering result using a scatterplot that highlights clusters using different colors or shapes; (5 Marks)
Q2. How many iterations are needed for k-Means clustering on the dataset when k=3 with the initial cluster centres being (0.8, 7.2), (0.7, 1.1) and (4.3, 1.8). Please present the result of each iteration of the clustering process as well as visualize the final clustering result using a scatterplot that highlights clusters using different colors or shapes; (5 Marks)
Q3. How many iterations are needed for k-Means clustering on the dataset when k=3 with the initial cluster centres being (4, 9), (0.7, 1.1) and (4.3, 1.8). Please present the result of each
iteration of the clustering process as well as visualize the final clustering result using a scatterplot that highlights clusters using different colors or shapes; (5 Marks)
Q4. Based on your observations on the clustering results of Questions 1-3, please comment on the impact of the values of k and the initial selection of the cluster centres on the efficiency (i.e., the number of iterations taken) and accuracy of clustering of the k-Means clustering on the dataset; (4 Marks)
Q5. Which data in the dataset are more likely considered as outlier(s) and why? (1 Marks)
Note:
• For Q1-3, the result of each iteration of clustering is presented in the following format which contains the information of the centroid as well as the data in each cluster:
Iteration i: Centroid of Cluster k (x, y)
Data in Cluster k (data1, data 2, …data n)
o For example:
Iteration 1: Centroid of Cluster 3 (0.5, 0.7)
Data in Cluster 3 (#1, #4, #7)
means in the first iteration, the centroid of the 3rd cluster is (0.5, 0.7) and the cluster contains three data objects which are the 1st, 4th and 7th data in the dataset, respectively.
• Please kindly note that you do not need to do any programming or implementation to complete the small project. As you can see, the small project is actually a few questions around k-Means on a toy dataset. Given the extremely small number of data in the dataset, it is expected that you go through each iteration of the clustering process manually so that you can gain a better hands-on understanding on how k-Means works. Of course, there is no way for me to stop you from using any existing tools or systems. However, please be mindful that many of the existing tools/systems only provide the last result of the clustering without giving you the results of the intermediate iterations.
• To present clusters using different colors in Excel, you need to create different data series corresponding to different clusters. Right-click your initial scatterplot and choose to select data from the spreadsheet for creating different data series. Then, Excel will use different colors automatically to show different data series in the scatterplot. You can also use any other visualisation tools or packages to present the clusters.
Part II Major Course Project (80 Marks)
2.1 Project options
There are two possible types of major projects that you can conduct in this course based on your personal preference.
Option 1: Implementation and application of the existing data mining algorithms
You can choose to implement at least ONE of the following mainstream data mining algorithms:
• k-Means algorithm for clustering
• Apriori algorithm for associate rule mining
• LOF algorithm for outlier detection
Besides implementing one of the above algorithms, you need to apply the algorithm to at least one real-life dataset to perform the corresponding data mining function, either clustering, association rule mining or outlier detection. You need to present the results obtained by applying the algorithm together with any useful, interesting findings, such as the patterns or knowledge, discovered from the dataset.
Option 1 is appropriate for the students with strong programming background who are interested in implementing mainstream data mining algorithms in order to understand how those techniques work under the hood. Necessary graphical user interface should be developed to allow friendly human-computer interaction and result visualisation. It's preferred that the algorithms be implemented using one of the popular programming languages such as C/C++, Java or Python, but we accept other programming languages for implementation as well.
It is important to note that in implementing the algorithm(s), you are not allowed to directly call the algorithm if it has already been well encapsulated in the programming language. In other words, you are not allowed to call k-Means algorithm (Apriori or LOF algorithm) directly as a function from the library if it has already been implemented by the programming language. Source codes will be checked by the marker to ensure this requirement is reinforced.
Option 2: Conduct your own project
You can also opt in doing a research project in data mining if you already have some problems which can be solved using data mining methods based on your working or studying experience.
It is expected that the project is consistent with the first option in terms of the difficulty level and workload. Overly simple projects may be subject to a great loss of marks. Students choosing this option are strongly encouraged to approach the examiner for consultation to clear any doubts about their project topics.
The projects in this category should encompass the following ingredients, though your report may not be organised in the exact same way:
• Motivations: the students should establish the motivations for pursuing this project. What are the problems you want to solve using data mining methods?
• Methodology: what are the data mining techniques and/or systems you want to use to solve the problem and how to do this?
• Datasets: what are the dataset(s) to be used for this project and how do you acquire them?
• Experimental evaluation: how well the applied data mining methods/systems solve your problem in terms of efficiency (speed) and effectiveness or against other performance metrics if appropriate?
You can carry out programming and/or use the existing data mining software or tools such as
WEKA or Rapidminer to complete the project.
Option 2 is appropriate to the students who already have some practical problems in mind to solve using data mining techniques and may have some previous experience in data analytics.
2.2 Deliverables of the major project
You need to deliver the following several items in the major project, regardless of the option you choose.
a) Proposal of the major project
A proposal (1–2 pages) for the final major project needs to be submitted which outlines the background, motivations or aims, problem formulation, possible data mining solution, system architecture design, datasets and a timeline for completing the project. What you need to submit is a standalone PDF file.
It is expected that your proposal is consistent with the final project you will conduct, even though opportunities are provided for updating your proposal in the final submission.
b) Report of the major project
Typically, the report should be over 20-page long (single spacing with reasonable margins) and must at least contain the following sections:
• A 1-page cover page at the beginning of the report which contains the following basic information:
o Your name;
o Your USQ ID;
o Your project option (1 or 2);
o Abstract of your project (Briefly present the problem you want to solve, the data mining method(s) you have used and the final mining results)
• Background and motivation. Discuss the background of the project and motivation behind your decision for carrying out the project;
• Problem formulation. Present the formulation of the problem you are going to resolve using data mining techniques and point out the possible challenges in solving the problem;
• Literature review. A short literature review that you have conducted to survey the existing work related to the data mining problem that you are going to tackle and the existing methods in literature. This review is supposed to be brief and the coverage of
over 10 related and recent (e.g., published in the last five years) papers is deemed sufficient. Complete and correct reference information need to be provided at the end of the report for the papers cited;
• Design and architecture. Present your design as to how the problem can be solved. For example, you can present the workflow to show the steps or procedures to solve the problem. A diagram showing the different functional modules involved in the system and their interaction and relationships can also be presented;
• Dataset(s). A description of the dataset(s) that you have used in your project for validating the performance of your data mining method;
• Snapshots. The complete set of snapshots of the system interfaces. You can capture the screen using the screen capture software;
• Use instructions. A document contains instructions as to how to compile and execute your program. If you are conducting your own project in Option 2, you should provide instruction on how to use your data mining methods/systems to solve your problem. Please be as specific as possible in the instructions;
• Findings, lessons and experiences. Please discuss any useful, interesting knowledge and patterns discovered from your project as well as the lessons you learn from this project and any experiences you would like to share from this work;
• Conclusion. Conclude the whole project and possibly identify the limitation of the current system and the possible future work.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme