Data Selection and Pre-processing
● Select a data set consisting of at least 2,000 observations/records and preferably above 10,000. You are strongly encouraged to identify an anonymized data set, the strategic objectives of the business.
● Briefly describe your data set and reference its origin.
● If you have 15 or fewer attributes, table your attributes with the attribute name, description, and data type and then show the minimum/average/maximum and stdev values for the training set and test set. For nominal variables, then show the most and least frequently occurring nominal value(s). If you have more than 15 attributes, then group attributes into themes (e.g. customer, orders, employees) and describe the type of information and data types in each theme including a number of each variable type (e.g. nominal, interval, ratio etc). You may want to highlight significant variables identified by some attribute selection algorithm.
● Briefly table the following characteristics of the entire data set: number of instances, patterns per target class (if classification), limitations such as possible conflicting patterns, missing values, outliers/erroneous values.
● Explain how you have sampled your data to create the ‘in sample’ and ‘out of sample’ data sets. If you have used instance weightings to balance your data set(s) then explain how the weightings were determined.
● Provide a statistical summary in tabular form for the resulting ‘in sample’ (training/validation set) and ‘out of sample’ (test set). Also, state whether or not there was any overlap in training and test set instances and if so, justify why your test set is not compromised.
● What pre-processing and transformation was performed on the variables and why? (e.g. standardizing numerical variables and/or using scaling, taking logs to reduce skewness, or log differences to reduce non-stationarity; converting numerical variables to discrete ones; converting numerical or symbolic patterns into bit patterns; removing patterns with missing or outlier values; adding noise or jitter to patterns to expand the data set; adding instance weightings or replicating certain pattern classes to improve class distributions; transforming time-series data into static training/test patterns)
● How did you ensure that your pre-processing did not compromise your test set (e.g. use of standardization)
● Consideration will also be given to the ‘curse of dimensionality, its issues and how its impact can be reduced. o If you reduced the number of dimensions (e.g. from 30 attributes to 10 attributes), how did you do this? Autoencoder? PCA? Filter using InfoGain measurement? A cluster? How do these methods work and what are their advantages/disadvantages?
● If you increased the number of training instances, how did you do this?
Machine Learning Method(s) and their Implementation
▪ Clearly state the machine learning methods you will be using the function(s) you will be expecting them to perform (e.g. classification, association, regression, clustering or combinations thereof for self-supervised learning). You must describe the expected ‘input to’ and ‘output from’ each model.
▪ Explain and justify the machine learning method(s) chosen for the task. You must also use a simple benchmark model with which to compare your chosen machine learning model(s) (e.g. Page 4 of 6 benchmark a neural network trained with back-propagation against a simple OneR or Naive Bayes approach).
▪ Briefly highlight the strengths and weaknesses of the chosen learning method(s).
▪ Describe your ‘model fitting’ and ‘model selection’ process (e.g. leave-one-out validation, cross-validation, bagging and boosting, etc). You must state and justify the hyper-parameters used for model fitting and how ‘over-training will be minimized.
▪ Describe what tool will be used to implement the machine learning method(s) (e.g. Weka/Java).
▪ You must either: a) use advanced features of the chosen analytics tool including (though not limited to) clear evidence of meaningful programming/scripting activity to use machine learning and/or pre-processing tools in a bespoke way (e.g. install and use advanced Weka packages via Package Manager – examples might be: simple recurrent network, convolutional neural network, Self-organizing maps, Time Series processing with ARMA models). OR
▪ provide an in-depth mathematical treatment of the chosen machine learning method(s) with a clear explanation as to how you will optimize them using the built-in features of the data analytics tool.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme