Task
Build effective classifiers to distinguish between malware and benign files, which are described by static and dynamic features.
Learning Goals
• Gain experience building a variety of classification models.
• Learn how to handle data with significant class imbalance.
• Gain experience in using classification methods to identify malware, including cases where there are an enormous number of features.
Dataset Description
The data set we use for this lab was generated from data obtained from the UCI repository1. The original data was distributed over three separate files, two of which contain descriptions of malware and one of benign files. The three files were merged into a single file for this lab (a few fields in the separate files did not overlap so a bit of preprocessing was needed to merge them). The data file that you will need for this lab is located at: https://storm.cis.fordham.edu/~gweiss/classes/cisc5660/data/Malware-staDyn-data.csv
Table 1: Summary Statistics of dataset
Name Malware-staDyn-data.csv
# Records 6,248
# Attributes 1,085
Class variable “label” located as last feature
Class values 0 (benign) and 1 (malware)
Class distribution 90.5% malware (n=5653) and 9.5% benign (n=595)
There original data did not contain a detailed description of each feature, but some general information can be obtained from the paper “Protecting from Malware Obfuscation Attacks through Adversarial Risk Analysis”2 (see Section 2.2 on feature extraction). As that paper discusses, static features include Assembly Language File (ASM) features, Hex dump features, and Portable Executable File Header (PE Header) features. Dynamic features are generated based on the run time behavior of the binaries executed within a Virtual Machine (the Cuckoo Sandbox environment was used with a two minute default time). The 12 features most relevant for malware detection were included.
Please follow footnote 2 and read over section 2.2 and browse other parts of the paper if interested.
What You Need to Do for this Lab
For this lab you need to build and evaluate a Decision Tree, Random Forest, and kNN classifier to distinguish between malware and benign examples from the supplied dataset. You will need to run all experiments on the unbalanced training data, one experiment on rebalanced training data using random oversampling (ROS), and several on rebalanced training data using SMOTE. You will also need to vary certain model parameters. You should generate and submit a table that is formatted like Table 1 below, which effectively specifies the details of the experiments. You also need to supply your code (I suggest you use Jupyter Notebook). The code must be well commented and this will impact your grade.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme