logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Sean WooddEconomics
(4/5)

843 Answers

Hire Me
expert
Nikhil AroraGeneral article writing
(5/5)

923 Answers

Hire Me
expert
Isaiah RobertsComputer science
(5/5)

998 Answers

Hire Me
expert
Mohammad KaifAccounting
(5/5)

840 Answers

Hire Me
R Programming
(5/5)

The programmers found millions of files and proceeded to classify the files manually using a 3rd party system at virustotal.com.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Creating a model to detect malware using supervised learning algorithms 

Background

Product Development Grant

TOBORRM’s has received an industry grant to develop malware detection algorithms based on behaviours and file parameters. The software development team at TOBORRM wrote a file-download identifier that scoured the internet for downloadable content. The goal was to develop a data set that can be used to identify malware based on parameters such:

• Where the file came from

• How big the file was and What type of file it is

• as well as many other characteristics (or features).

MalwareSamples Data

The programmers found millions of files and proceeded to classify the files manually using a 3rd party system at virustotal.com. As noted in the previous assessment for each file specimen collected:

1. TOBORRM’s data collector would send the file to virustotal.com

2. files were tagged as “Malicious” if a majority of virustotal.com virus scanners recognised the file as containing malware (see Figure 1)

3. Files were tagged as “Clean” if ALL virustotal.com scanners identified the file as “Clean”. (see Figure 1)

As such, the “Actually Malicious” field can be considered to be a generally accurate classification for each downloaded sample.

Initially the security and software development teams believed they would be able to gain insight from various statistical analyses of the dataset. Their initial attempts to classify data lacked sensitivity and had many false positives, the results of TOBORRM’s analysis have been included in the “Initial Statistical Analysis” column – the results of this analysis are poor. 

Data set columns

The data set created by TOBORRM’s developers includes the following descriptions each column’s source:

 

Download Source

A description of where the sample came from

TLD

Top Level Domain of the site where the sample came from

Download Speed

Speed recorded when obtaining the sample

Ping Time To Server

Ping time to the server recorded when accessing the sample

File Size (Bytes)

The size of the sample file

How Many Times File Seen

How many other times this sample has been seen at other sites (and not downloaded)

Executable Code Maybe Present in Headers

‘CodeCheck’ Program has flagged the file as possibly containing executable code in file headers

Calls to Low-Level System Libraries

When the file was opened or run, how many times were low-level Windows System libraries accessed

Evidence of Code Obfuscation

‘CodeCheck’ Program indicates that the contents of the file may be Obfuscated

Threads Started

How many threads were started when this file was accessed or launched

Mean Word Length of Extracted Strings

Mean length of text strings extracted from file using Unix ‘strings’ program

Similarity Score

An unknown scoring system used by ‘CodeCheck’ seems to be the score of how similar the file is to other files recognised by ‘CodeCheck’

Characters in URL

How long the URL is (after the .com / .net part). E.g., /index.html = 10 characters

Actually Malicious

The correct classification for the file

Previous System Performance

Performance of “FileSentry3000™ v1.0”

 

SCENARIO

The industry grant from TOBORRM requires that they provide a clear case for whether machine learning algorithms could solve the problem of classifying malicious software. Your task is to build on your previous work and run the data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy.

TASK

You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.

Part 1 – General data preparation and cleaning.

a) Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.

b) Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:

i. For How.Many.Times.File.Seen, set all values = 65535 to NA;

ii. Convert Threads.Started to a factor whose categories are given by 

1 = 1 thread started 

2 = 2 threads started

3 = 3 threads started

4 = 4 threads started

5 = 5 or more threads started

Hint: Replace all values greater than 5 with 5, then use the factor(.) function.

iii. Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)

iv. Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.

Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.

c) Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.

Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time. 

Part 2 – Compare the performances of different machine learning algorithms

a) Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels.

library(tidyverse)
set.seed(Enter your student ID)
models.list1 <- c("Logistic Ridge Regression",
                  "Logistic LASSO Regression",
                  "Logistic Elastic-Net Regression")
models.list2 <- c("Classification Tree",
                  "Bagging Tree",
                  "Random Forest")
myModels <- c("Binary Logistic Regression",
              sample(models.list1,size=1),
              sample(models.list2,size=1))
myModels %>% data.frame 

For each of your ML modelling approaches, you will need to:

a) Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.

b) Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):

c) Evaluate the performance of each ML models on the test set. Provide the confusion matrices and report the following:

• Sensitivity (the detection rate for actual malicious samples)

• Specificity (the detection rate for actual non-malicious samples)

• Overall Accuracy

d) Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, accuracy, and to a lesser extent, interpretability should be taken into account. 

e) Create a confusion matrix for the variable Initial.Statistical.Analysis in the test set. Recall that the data in this column correspond to TOBORRM’s initial attempt to classify the samples. Compare and comment on the performance of your optimal ML model in part d) to the initial analysis by the TOBORRM team.

What to submit

Gather your findings into a report (maximum of 5 pages) and citing sources, if necessary.

Present how and why the data was manipulated, how the ML models were tuned and finally how they performed to each other and to the initial analysis by TOBORRM. You may use graphs, tables and images where appropriate to help your reader understand your findings.

Make a final recommendation on which ML modelling approach is the best for this task.

Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.

Your submission must include the following:

• Your report (5 pages or less, excluding cover/contents page)

• A copy of your R code, and two csv files corresponding to your training and test datasets.

The report must be submitted through TURNITIN and checked for originality. The R code and data sets are to be submitted separately via a Blackboard submission link. 

Note that no marks will be given if the results you have provided cannot be confirmed by your code. Furthermore, all pages exceeding the 5-page limit will not be read or examined.

Academic Misconduct 

Edith Cowan University regards academic misconduct of any form as unacceptable.  Academic misconduct, which includes but is not limited to, plagiarism; unauthorised collaboration; cheating in examinations; theft of other student’s work; collusion; inadequate and incorrect referencing; will be dealt with in accordance with the ECU Rule 40 Academic Misconduct (including Plagiarism) Policy. Ensure that you are familiar with the Academic Misconduct Rules. 

Assignment Extensions

Applications for extensions must be completed using the ECU Application for Extension form, which can be accessed online.

Before applying for an extension, please check out the ECU Guidelines for Extensions which details circumstances that can and cannot be used to gain an extension. For example, normal work commitments, family commitments and extra-curricular activities are not accepted as grounds for granting you an extension of time because you are expected to plan ahead for your assessment due dates.

Please submit applications for extensions via email to both your tutor and the Unit Coordinator.

Where the assignment is submitted no more than 7 days late, the penalty shall, for each day that it is late, be 5% of the maximum assessment available for the assignment. Where the assignment is more than 7 days late, a mark of zero shall be awarded.

 

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme