Parkinson's disease is a brain disorder that leads to shaking, stiffness, and difficulty with walking, balance, and coordination. Approximately 60,000 Americans are diagnosed with Parkinson's disease each year. More than 10 million people worldwide are living with Parkinson’s disease. Parkinson’s disease can be diagnosed based on medical history, a review of your signs and symptoms, and a neurological and physical examination. Additional, speech data are also useful for non-invasive diagnosis. This project aims to analyze speech data from 40 subjects and develop machine learning models to predict Parkinson’s disease.
Data: The dataset attached consists of training and testing files. The training data belongs to 20 subjects with Parkinson (6 female, 14 male) and 20 healthy individuals (10 female, 10 male). From all subjects, multiple types of sound recordings (26 voice samples including sustained vowels, numbers, words, and short sentences) are taken. A group of 26 linear and time frequency-based features are extracted from each voice sample. The testing data set is collected from 28 Parkinson’s disease patients. The patients are asked to say only the sustained vowels 'a' and 'o' three times respectively which makes a total of 168 recordings. The same 26 features are extracted from voice samples of this dataset.
Each subject has 26 voice samples including sustained vowels, numbers, words and short sentences. The voice samples in the training data file are given in the following order:
Sample # - corresponding voice samples
1: sustained vowel (aaa), 2: sustained vowel (ooo), 3: sustained vowel (uuu), 4-13: numbers from 1 to 10, 4-
17: short sentences, 18-26: words.
28 PD patients are asked to say only the sustained vowels 'a' and 'o' three times respectively which makes a total of 168 recordings (each subject has 6 voice samples) The voice samples in the test data file are given in the following order:
Sample# - corresponding voice samples
1-3: sustained vowel (aaa), 4-6: sustained vowel (ooo)
Feature Information:
Column 1: Subject id, Colum 2-27: features
Features 1-5: Jitter (local), Jitter (local, absolute), Jitter (rap), Jitter (ppq5), Jitter (ddp),
Features 6-11: Shimmer (local), Shimmer (local, dB), Shimmer (apq3), Shimmer (apq5), Shimmer (apq11), Shimmer (dda),
Features 12-14: AC, NTH, HTN (measures of the ration of the total noise component in the voice) features 15-19: Median pitch, Mean pitch, Standard deviation, Minimum pitch, Maximum pitch,
Features 20-23: Number of pulses, Number of periods, Mean period, Standard deviation of period, features 24-26: Fraction of locally unvoiced frames, Number of voice breaks, Degree of voice breaks
Column 28: class information Test Data File:
Column 1: Subject id, Colum 2-27: features
Features 1-5: Jitter (local), Jitter (local, absolute),Jitter (rap),Jitter (ppq5),Jitter (ddp),
Features 6-11: Shimmer (local), Shimmer (local, dB), Shimmer (apq3), Shimmer (apq5), Shimmer (apq11), Shimmer (dda),
Features 12-14: AC, NTH, HTN,
Features 15-19: Median pitch, Mean pitch, Standard deviation, Minimum pitch, Maximum pitch, Features 20-23: Number of pulses, Number of periods, Mean period, Standard deviation of period, Features 24-26: Fraction of locally unvoiced frames, Number of voice breaks, Degree of voice breaks Column 28: class information
Analysis: (use train_data.csv for step 1-3 and test_data.csv for step 4)
a. From the training dataset, select one subject (26 samples) as the testing set; use the rest subjects’ data to train a classification model.
b. Use the model to compute predictions for testing set. Take the majority vote of the predictions as the class of the testing subject.
c. Repeat step a and b for all subjects; compare the predicted and true classes to evaluate the model performance.
d. Repeat step a-c for different tuning parameters to identify the best model.
e. Try Logistic Regression, SVM, and Random Forest, and compare their performance.
a. Calculate central tendency and dispersion features:
(1) Central tendency features: mean, median, trimmed mean of the 26 voice samples of each subject for different attributes.
(2) Dispersion features: Standard deviation, interquartile range, mean absolute deviation of the 26 voice samples of each subject for different attributes.
οΌ Trimmed mean is the average of the samples after removing a 25% of the largest and smallest values.
οΌ Mean absolute deviation is
1 ∑π
|π₯− π₯Μ |, where π₯Μ is the mean of π₯
π π=1 π
a. From the training dataset, select one subject as the testing set; use the rest subjects’ data to train a classification model.
b. Use the model to compute the prediction for the testing set.
c. Repeat step a and b for all subjects; compare the predicted and true classes to evaluate the model performance.
d. Repeat step a-c for different tuning parameters to identify the best model.
e. Try Logistic Regression, SVM, and Random Forest, and compare their performance.
a. Use the classification models (Logistic Regression, SVM, and Random Forest) from step 1 and 3 to predict subjects’ status in the testing dataset; evaluate the model performance.
Report: Write a final project report using the template provided.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme