Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Drop Files Here Or Click to Upload

Or Get Complete Course Help

Chris EldredgriggMathematics

(5/5)

544 Answers

Hire Me

Dick BentleyEnglish

(5/5)

634 Answers

Hire Me

Connor EvansData mining

(5/5)

581 Answers

Hire Me

Eugene BaranowskiMathematics

(5/5)

739 Answers

Hire Me

Others

(5/5)

You will present the coursework in the form of a technical report containing the five sections listed in

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Assessment Requirements

This coursework is an individual assignment. You must implement it independently.

In this coursework, you are asked to analyse the "2020 Stack Overflow Annual Developer Survey" dataset which is available at https://insights.stackoverflow.com/survey. The data were collected by Stack Overflow, which is a question and answer site for professional and enthusiast programmers. In 2020, there were 65,000 responses from over 180 countries and dependent territories. This survey examines all aspects of programmers’ experience from career satisfaction to opinions on open source software. This information could be useful for you as program developers. For an overview of 2020 Stack Overflow’s Developer Survey, you may refer to https://insights.stackoverflow.com/survey/2020.

The following tasks are required in the coursework:

(1) To understand the data set. You must implement this task by exploratory data analysis through Python programming.

(2) To describe characteristics of low-income developers and high-income developers. You must implement this task by cluster analysis through Python programming. Note: Low income is defined as salary up to the median of all developer’s annual salaries, while high-income as salary more than the median of all developer’s annual salaries. Income is measured by the column ConvertedComp which represents Salary converted to annual USD salaries.

(3) To build machine learning models for predicting whether a developer is in high-income based on survey data. You must implement this task by classification through Python programming.

There are over 60 columns in the data set. Obviously, some attributes (e.g. country or education) have a bigger impact on the income than others. Therefore, you will decide by yourself which features should be adopted in data analysis tasks. Good selection of features may improve the performance of your machine learning models.

You will present the coursework in the form of a technical report containing the five sections listed in Table 1. A report template is provided at the end of this specification. Please follow the template.

Section		Weighting		Recommended Pages per Section
1. Introduction		0.1		1-2
2. Data Understanding and Exploratory Data Analysis		0.2		7-10
3. Cluster Analysis	0.1		2-4
4. Machine Learning Methods and their Implementation	0.3		7-10
5. Evaluation Machine Learning Models	0.2		4-5
6. Discussions and conclusions	0.1		1

A report with 20-30 pages is recommended. The report in total, however, must not exceed 30 pages (excluding title page, contents page, references, and appendices) with the font Calibri and size 11 or 12 in the main text. A penalty of a single grade will be incurred if you exceed the 30-page limit. You may put extra information in appendices which is not counted in 30-page limit.

You are asked to write the report with the provided report template at the end of the template. It is recommended to cite and list referees using Harvard Referencing style (see https://www.ntu.ac.uk/m/library/referencing-made-easy). However, other (author, year) styles like APA are also accepted.

By the submission deadline, you are expected to submit both your report (in MS Word or PDF format) and the Python source code (in plain text format) to NOW Dropbox under the `report’ folder.

Your report will be assessed according to the assessment criteria provided in Section II.

The remainder of this specification provides you with detailed requirements for each area of content – you should read it very carefully.

1. Introduction

• Describe the CRISP-DM methodology for data analytics projects (Wirth and Hipp, 2000) and discuss its applications and importance with appropriate reference to the literature.

• State the data analytic task in your coursework and explain how you are applying CRISP-DM to your coursework.

• State the insight you intend to gain in the coursework.

2. Data Understanding, Data Pre-processing, Exploratory Data Analysis

• Describe the Survey Data Set (Stack Overflow Developer Survey, 2020)

• Briefly describe data attributes with attribute name, description and data type use descriptive statistics and exploratory data analysis (Larose, and Larose, 2015). Note: it is not necessary include all results (e.g., tables and figures) in the main text. Only select most important results in the main text and leave others in the appendix.

• Describe the characteristics of the data set, such as (though not limited to) the number of instances, possibly duplicate or conflicting instances, missing values, or erroneous values, outliers.

• If any duplicate or conflicting instances, missing values, outliers/erroneous values, outliers exist, describe the process of cleaning these data.

• Conduct the exploratory data analysis on the data set, for example (though not limited to), identify outliers using histogram or box-plot, or scatter-plot; visualise the percentage of classes using pie-plot; explore the relation between features and target variable using crosstab and staked bar plot and so on.

3. Cluster Analysis

• Split the data set into two subsets, labelled low-income (= up to the median of all developer’s annual salaries) or high-income (= more than the median of all developer’s annual salaries). Note: income is given in the column ConvertedComp which represents Salary converted to annual USD salaries. Perform cluster analysis of these two subsets separately using some clustering methods (such as k-Means and hierarchical clustering).

• If applicable to a machine leaning method, describe the process of data transformation and normalization used in that method.

• Implement cluster analysis on each subset. Describe parameter setting, initialisation, stopping criterion and discuss your choice of cluster number.

• Describe characteristics of low-income developer clusters and high-income developer clusters found in cluster analysis.

4. Machine Learning for Classification and their Implementation

• Describe the workflow of machine learning for classification using a flow-chart(s).

• State and describe classification methods that are used in your coursework. At least three classifiers should be chosen for the classification problem. The methods may be chosen from those taught in this module, such as k-Nearest Neighbour, Decision Trees, Logistic Regression, Artificial Neural networks. It is also allowed to choose methods that are not taught in this module.

• Describe parameter setting, data transformation and normalisation in the methods that you have chosen for the task.

• If applicable to a machine leaning method, describe the process of data transformation and normalization used in that method.

• Build and implement machine learning models and tune hyper-parameters in these models for good performance. You may implement these models using Sklearn modules. It is also allowed to use any other Python libraries that are not taught in this module.

• Implement ensemble learning of combining your classifiers together. Describe the ensemble method(s) that you are using.

5. Evaluation Machine Learning Models

• Evaluate and compare the performance of the machine learning models (both base and ensemble models). You should at least use one or more of the performance metrics (as appropriate), such as accuracy, confusion matrix, recall and precision, or ROC curve.

• Explain your results. Generate tables to list the results or figures to visualize the results.

• Review the performance of different models (base and ensemble models). You may critically review which model performed best and which hyper-parameter settings were most effective? Provide necessary explanations.

6. Discussions and Conclusions

 Summarise your task and your findings in the data analysis on this survey data set.

 Describe what kind of insight that you have gained from the module “machine learning for data analytics”.

 Explain whether and how well has the module developed your understanding of machine learning for data analytics?

Finally, it must be pointed out that there exist a few notebooks on Kaggle and GutHub which are working on Stack Overflow Annual Developer Survey 2020. It is allowed for you to study these notebooks, but you must implement own coding in your coursework. You have one chance to check the similarity between your coursework and others through submitting your report and source code to Draft report and Draft code respectively on NOW Dropbox. Turnitin similarity score for report must be below 30%, while the score for your code below 50%. Coursework which violates the thresholds will be downgraded.

(5/5)

Hurry, Grab up to 30% discount on the entire course

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Chris EldredgriggMathematics

Dick BentleyEnglish

Connor EvansData mining

Eugene BaranowskiMathematics

Others

You will present the coursework in the form of a technical report containing the five sections listed in

ANSWER ALL QUESTIONS

Assessment Requirements

The following tasks are required in the coursework:

1. Introduction

2. Data Understanding, Data Pre-processing, Exploratory Data Analysis

3. Cluster Analysis

4. Machine Learning for Classification and their Implementation

5. Evaluation Machine Learning Models

6. Discussions and Conclusions

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

Other Services

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Chris EldredgriggMathematics

Dick BentleyEnglish

Connor EvansData mining

Eugene BaranowskiMathematics

Others

You will present the coursework in the form of a technical report containing the five sections listed in

ANSWER ALL QUESTIONS

Assessment Requirements

The following tasks are required in the coursework:

1. Introduction

2. Data Understanding, Data Pre-processing, Exploratory Data Analysis

3. Cluster Analysis

4. Machine Learning for Classification and their Implementation

5. Evaluation Machine Learning Models

6. Discussions and Conclusions

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer