TOBORRM’s has received an industry grant to develop malware detection algorithms based on behaviours and file parameters. The software development team at TOBORRM wrote a file-download identifier that scoured the internet for downloadable content. The goal was to develop a data set that can be used to identify malware based on parameters such:
• Where the file came from
• How big the file was and What type of file it is
• as well as many other characteristics (or features).
The programmers found millions of files and proceeded to classify the files manually using a 3rd party system at virustotal.com. As noted in the previous assessment for each file specimen collected:
1. TOBORRM’s data collector would send the file to virustotal.com
2. files were tagged as “Malicious” if a majority of virustotal.com virus scanners recognised the file as containing malware (see Figure 1)
3. Files were tagged as “Clean” if ALL virustotal.com scanners identified the file as “Clean”. (see Figure 1)
As such, the “Actually Malicious” field can be considered to be a generally accurate classification for each downloaded sample.
Initially the security and software development teams believed they would be able to gain insight from various statistical analyses of the dataset. Their initial attempts to classify data lacked sensitivity and had many false positives, the results of TOBORRM’s analysis have been included in the “Initial Statistical Analysis” column – the results of this analysis are poor.
The data set created by TOBORRM’s developers includes the following descriptions each column’s source:
Download Source |
A description of where the sample came from |
TLD |
Top Level Domain of the site where the sample came from |
Download Speed |
Speed recorded when obtaining the sample |
Ping Time To Server |
Ping time to the server recorded when accessing the sample |
File Size (Bytes) |
The size of the sample file |
How Many Times File Seen |
How many other times this sample has been seen at other sites (and not downloaded) |
Executable Code Maybe Present in Headers |
‘CodeCheck’ Program has flagged the file as possibly containing executable code in file headers |
Calls to Low-Level System Libraries |
When the file was opened or run, how many times were low-level Windows System libraries accessed |
Evidence of Code Obfuscation |
‘CodeCheck’ Program indicates that the contents of the file may be Obfuscated |
Threads Started |
How many threads were started when this file was accessed or launched |
Mean Word Length of Extracted Strings |
Mean length of text strings extracted from file using Unix ‘strings’ program |
Similarity Score |
An unknown scoring system used by ‘CodeCheck’ seems to be the score of how similar the file is to other files recognised by ‘CodeCheck’ |
Characters in URL |
How long the URL is (after the .com / .net part). E.g., /index.html = 10 characters |
Actually Malicious |
The correct classification for the file |
Previous System Performance |
Performance of “FileSentry3000™ v1.0” |
The industry grant from TOBORRM requires that they provide a clear case for whether machine learning algorithms could solve the problem of classifying malicious software. Your task is to build on your previous work and run the data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy.
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.
a) Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.
b) Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:
i. For How.Many.Times.File.Seen, set all values = 65535 to NA;
ii. Convert Threads.Started to a factor whose categories are given by
1 = 1 thread started
2 = 2 threads started
3 = 3 threads started
4 = 4 threads started
5 = 5 or more threads started
Hint: Replace all values greater than 5 with 5, then use the factor(.) function.
iii. Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)
iv. Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
c) Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.
Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time.
a) Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels.
library(tidyverse)
set.seed(Enter your student ID)
models.list1 <- c("Logistic Ridge Regression",
"Logistic LASSO Regression",
"Logistic Elastic-Net Regression")
models.list2 <- c("Classification Tree",
"Bagging Tree",
"Random Forest")
myModels <- c("Binary Logistic Regression",
sample(models.list1,size=1),
sample(models.list2,size=1))
myModels %>% data.frame
For each of your ML modelling approaches, you will need to:
a) Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.
b) Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):
c) Evaluate the performance of each ML models on the test set. Provide the confusion matrices and report the following:
• Sensitivity (the detection rate for actual malicious samples)
• Specificity (the detection rate for actual non-malicious samples)
• Overall Accuracy
d) Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, accuracy, and to a lesser extent, interpretability should be taken into account.
e) Create a confusion matrix for the variable Initial.Statistical.Analysis in the test set. Recall that the data in this column correspond to TOBORRM’s initial attempt to classify the samples. Compare and comment on the performance of your optimal ML model in part d) to the initial analysis by the TOBORRM team.
Gather your findings into a report (maximum of 5 pages) and citing sources, if necessary.
Present how and why the data was manipulated, how the ML models were tuned and finally how they performed to each other and to the initial analysis by TOBORRM. You may use graphs, tables and images where appropriate to help your reader understand your findings.
Make a final recommendation on which ML modelling approach is the best for this task.
Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.
• Your report (5 pages or less, excluding cover/contents page)
• A copy of your R code, and two csv files corresponding to your training and test datasets.
The report must be submitted through TURNITIN and checked for originality. The R code and data sets are to be submitted separately via a Blackboard submission link.
Note that no marks will be given if the results you have provided cannot be confirmed by your code. Furthermore, all pages exceeding the 5-page limit will not be read or examined.
Edith Cowan University regards academic misconduct of any form as unacceptable. Academic misconduct, which includes but is not limited to, plagiarism; unauthorised collaboration; cheating in examinations; theft of other student’s work; collusion; inadequate and incorrect referencing; will be dealt with in accordance with the ECU Rule 40 Academic Misconduct (including Plagiarism) Policy. Ensure that you are familiar with the Academic Misconduct Rules.
Applications for extensions must be completed using the ECU Application for Extension form, which can be accessed online.
Before applying for an extension, please check out the ECU Guidelines for Extensions which details circumstances that can and cannot be used to gain an extension. For example, normal work commitments, family commitments and extra-curricular activities are not accepted as grounds for granting you an extension of time because you are expected to plan ahead for your assessment due dates.
Please submit applications for extensions via email to both your tutor and the Unit Coordinator.
Where the assignment is submitted no more than 7 days late, the penalty shall, for each day that it is late, be 5% of the maximum assessment available for the assignment. Where the assignment is more than 7 days late, a mark of zero shall be awarded.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme