Appendix B Variables in FRAMINGHAM dataset
Variable name |
Variable description
|
Status |
Person’s status (Alive, Dead)
|
Sex |
Sex (Female, Male)
|
Height |
Person’s height in inches
|
Weight |
Person’s weight in kg
|
Diastolic |
Person’s blood pressure in the arteries when the heart rests between beats. A normal diastolic blood pressure measurement is 80 or below.
|
Systolic |
Person’s blood pressure in the arteries: when the heart beats, it contracts and pushes blood through the arteries to the rest of the body. A normal systolic blood pressure measurement is 120 or below.
|
Smoking |
Number of cigarettes smoked per week
|
Death_age |
Age at death in years
|
Cholesterol |
Person’s cholesterol measurement
|
Chol_status |
Cholesterol status (borderline, desirable, high)
|
Bpressure_status |
Blood pressure (High, Normal, Optimal)
|
Weight_status |
Weight status (Underweight, Normal, Overweight)
|
Smoke_status |
Smoking status (Nonsmoker, Light, Moderate, Heavy, Very heavy)
|
Note Age of death is an interesting variable for exploratory data analyses (hence it’s inclusion) however it is recommended that this variable is rejected before applying directed data mining techniques.
The assignment is an individual assessment, requiring you to analyse the FRAMINGHAM heart study dataset within SAS Enterprise Miner, using the data mining techniques covered in the IMAT5238 module, and detailing your results, interpretations, conclusions and recommendations in a well-structured technical report.
• A sample of data from the FRAMINGHAM dataset is shown in Appendix A.
• The FRAMINGHAM dataset contains 5,209 observations and 13 variables. The variables in the dataset are shown in Appendix B.
• The coursework will be assessed according to the marking grid in Appendix C.
Each of you will be working on your own random sample of data generated by typically inserting the last 5 figures of your DMU student id number into the random seed generator within the Data Partition node. (Note: If spurious output for any of the models should occur, insert the last 4 figures (or the last 3 figures) of your DMU student id number into the random seed generator – to enable you to generate sensible output that you can interpret.)
• Conduct a thorough exploratory data analysis including investigation of outliers;
• Develop Regression, Decision Tree and Neural Network models, using data replacement, data transformation and variable selection methods, where appropriate;
• If appropriate you may choose to add one of the nodes from the research activity published in the general discussion board (DB Activity 3).
• Use your data mining skills and appropriate lift and diagnostic charts to ensure you get the best model from the above directed data mining techniques;
• Detail your results, interpretation, conclusions, recommendations and evaluation (including a critical appraisal of the data analyses conducted) of the first flow of the data mining cycle in a report. Assume the report is for the Chief Executive Officer (CEO) of The National Health Service.
You are required to submit a well-structured, detailed, technical report of 20 pages maximum (not including appendices) that gives details of:
• Description of the business problem and appropriate data mining problem and use of an appropriate data mining framework
• the exploratory data analysis;
• the technical interpretation of the models considered;
• a comprehensive assessment of the models;
• a full justification of the choice of final directed data mining model;
• if you have added a node from the research activity provide a justification for its use what you anticipate to find you should also assess it performance and limitations. You may wish to include one of the presentations in the appendix for reference, cite the original author.
• conclusions and recommendations resulting from your work and a critique of the analyses conducted.
• In the Appendix you must provide a workflow diagram of your process, and a short reflection identifying any lessons learned in doing the assignment, indicate any difficulties and how you overcame them.
This assignment contributes 70% of your final module grade. You will need to submit a copy of your report using the TurnitinUK link in the assessments section of the Data Mining module shell on Blackboard (to be made available prior to the coursework deadline).
It is not necessary to submit a hardcopy to the CEM advice centre, however you might find it useful to print a hardcopy for proof reading of your work. Note research has demonstrated that humans are more critical with paper materials than electronic screens.
Note The filename of your report should consist of your p-number and your name.
Assessment deadline: Week 33 Fri May 21th 2021 at noon
This coursework adheres to the Faculty of CEM Late Work Guidelines.
Cardiovascular disease (CVD) is the leading cause of death and serious illness in the United States. In 1948, the Framingham Heart Study - under the direction of the National Heart Institute embarked on an ambitious project. At that time, little was known about the general causes of heart disease, but the death rates for CVD had been increasing steadily since the beginning of the century and had become an American epidemic.
The objective of the Framingham Heart Study was to identify the risk factors that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed symptoms of CVD or had not suffered a heart attack or stroke. The researchers recruited 5,209 men and women between the ages of 30 and 62 from the town of Framingham, Massachusetts, and began the first round of extensive physical examinations and lifestyle interviews that they would later analyse for common patterns related to CVD development.
Since 1948, the subjects have continued to return to the study every two years for a detailed medical history, physical examination, and laboratory tests, and in 1971, the Study enrolled a second generation of the original participants' adult children and their spouses - to participate in similar examinations. Then in 2002 the study entered a new phase, the enrolment of a third generation of participants, the grandchildren of the original cohort. Today, the Framingham Heart Study remains a world-class epicentre for cutting-edge heart research and has been acclaimed as some of the most important to medicine. The concept of CVD risk factors has led to the development of effective treatment and preventive strategies in clinical practice.
You are a data miner and have been commissioned by The National Health Service to analyse the FRAMINGHAM dataset and to submit a technical report detailing your findings, which should include the best model that should be used to identify the risk factors that contribute to Cardiovascular Disease (CVD) and the characteristics of the participants after cluster analysis.
Note The analysis being conducted represents the first flow of the virtuous cycle of data mining.
Notes
Your report should be no longer than 20 A4 sides and minimum font size 12. Marks are awarded for technical correctness, descriptions of models, appropriate justification of node and parameter choices, appropriate actions to guard against overfitting, indications of model robustness, model limitations, data insights, analysis supported by appropriate charts.
Reports are expected to be written to a professional standard, clear concise, good use of English. Text supported by relevant choice of diagrams and use of tables to summarise data, avoidance of repetition and redundancy, appropriate use of appendices, table of contents and use of page numbers, table numbering and figure numbering, presence of an informative abstract or executive summary.
All diagrams must be legible and appropriately labelled if short of space use appendices. If you use coloured diagrams to illustrate or contrast points then a hardcopy of your report must be printed in colour. Assume that the audience of the report is senior management with no knowledge of the technical details of data mining.
You should budget half your time on using the SAS enterprise miner software to generate informative models, explore nodes and appropriate non-default options. The other half of your time should be budgeted on production of a clear, well structure report which describes your work and addresses the assignment brief.
The SAS software is a professional data mining product full of features and options. You are free to explore these options, however if you use something not covered in the course you must justify its use to receive credit. You will be penalized for inappropriate use of features you cannot adequately justify or explain. There will come a point of diminishing returns where no matter how much effort you put into the software you cannot improve on model performance. This will be the point you should focus on writing the report and analyse your results
You should start work on the assignment immediately and make allowance for any difficulties you might face. Leaving the work till the last minute will result in poor quality report. You should not underestimate the time it takes to become fluent in the use of the software. If you have followed all the labs attentively and with understanding you should not face too many hurdles. The completion of the assignment represents a cap-stone moment which will integrate everything you have learnt on the course.
A suggestion plan is that after completing a weeks lab activities on the HMEQ data set you apply what you have learnt to the Framingham data set. That way you do not leave it all the to the end. This would be Tim Urban’s approach not a good idea… you know how that end up for him.
Note you will have three weeks over the Easter vacation weeks 28-30) where you should be able to do the bulk of your work and you will have the knowledge to complete the later tasks. It is suggested that you split the work into 7 tasks.
1) Description of the business problem and appropriate data mining problem and use of an appropriate data mining framework the exploratory data analysis;
2) Develop Regression Model
3) Develop Decision tree
4) Develop Neural network
5) Research node Discussion board activity 3
6) Model comparison and recommendations
7) Proof reading and editing
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme