Scenario
WA Cyber Command – WACY-COM has acquired aggregate data about 200,000 identified cyber-attacks and scans. The data are sourced from a Honey-pot project which places fake servers across the globe and records attacker activity and techniques. As Honeypots are simulated networks and devices, they allow researchers to safely monitor malicious traffic without endangering real computers or networks.
When analysing cyber-attacks, the level of sophistication of attackers can range in from low-level scammers, right up to Advanced Persistent Threats (APTs) which are often associated with state-sponsored cyber-attacks. The attacker tools and techniques generally vary depending on the sophistication of the attacker.
A research project has been undertaken by WACY-COM to determine what patterns exist in state-sponsored APT attacks.
Typically, a complex attack can involve multiple attacking computers (with different source-IP addresses) and different payloads and targets. By coordinating attacks from multiple devices, the attacks can become more difficult to detect and stop.
Note: The scenario and data are loosely based on real-world cyber threats and attacks. However, this data set has been curated entirely to help you understand the types of data, correlations and issues that you may experience when handling real-world cyber security data.
Data description
The aggregated data available to WACY-COM are described by the following features (with data types given in square brackets):
[Categorical] Port – The port or service that was being attacked on the honey-pot network. Well known ports include 80/443 (Web traffic), 25 (Email reception), 993 (Email collection)
[Categorical] Protocol – The Internet Protocol in use to conduct the attack [Numeric] Hits – How many ‘hits’ the attacker made against the network [Numeric] Average Request Size (Bytes) – Average ‘payload’ sent by the attacker [Numeric] Attack Window (Seconds) – Duration of the attack
[Numeric] Average Attacker Payload Entropy (Bits) – An attempt to qualify whether payload data were encrypted (higher Shannon entropy may indicate random data, data obfuscation or encryption)
[Categorical] Target Honeypot Server OS – The Operating System of the simulated server [Numeric] Attack Source IP Address Count – How many unique IP addresses were used in the attack
[Numeric] Average ping to attacking IP (milliseconds) – Used to detect ‘distance’ to the attacker. The average ping time ‘back‘ to the attacker’s IP addresses were calculated. [Numeric] Average ping variability (st.dev) – High variability pings can indicate a saturated or unreliable link.
[Numeric] Individual URLs requested – How many different URLs were probed or attacked (Only relevant for Web Server ports)
[Categorical] Source OS (Detected) – The detected operating system of the attacking IP address. Acquired by scanning and fingerprinting the IP address of the attacking server [Categorical] Source Port Range – What range of source ports were used by the attacker. Typically, ‘low’ ports are reserved for system services. Higher ports are used by end- user applications.
[Categorical] Source IP Type (Detected) – Whether the IP of the attacker can be linked to known proxies/VPNs or TOR (technologies that can be used to hide the real source of the attack), or Likely ISP traffic (which may indicate the attacker is leveraging compromised end-user computers)
[Numeric] IP Range Trust Score – A trust score generated by an existing WACY-COM system. This system integrates with open-source intelligence (OS-Int) databases to identify potentially compromised on malicious IP addresses
[Binary] APT – Was the attack conducted by a known Advanced Persistent Threat actor (APT).
The raw data for the above variables are contained in the ML_dataset2.csv file.
Initially the research team believed they would be able to gain insight from various statistical analyses of the dataset. Their initial attempts to classify data lacked sensitivity and had many false positives. The results of WACY-COM’s analysis have been included in the Initial.Modelling.Result column – the results of this analysis are unacceptable.
Objectives
You have been brought on as part of a data analysis team to determine if APT activity can be inferred from other attack parameters.
Task
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to WACY-COM’s initial attempt to classify the samples.
Part 1 – General data preparation and cleaning.
a) Import the ML_dataset2.csv into R Studio. This version is the same as Assignment 1, but with an addition column at the end.
b) Write the appropriate code in R Studio to prepare and clean the ML_dataset2
dataset as follows:
i. Clean the whole dataset based on what you have suggested / feedback received for Assignment 1.
ii. For the feature Source.OS.Detected, merge its categories Windows 10 and Windows Server 2008 together to form a new category, say Windows_All. Similarly for Target.Honeypot.Server.OS, merge its categories Windows (Desktops) and Windows (Servers) to form the new category named Windows_DeskServ. Further, combine Linux and MacOS (All) to form the category MacOS_Linus. Hint: use the forcats:: fct_collapse(.) function.
iii. Log-transform Average.ping.variability using the log(.) function, and remove the original Average.ping.variability column from the dataset (unless you have overwritten it with the log-transformed data). Similarly, transform the following features using the square root, i.e. sqrt(.), function instead.
1. Hits;
2. Attack.Source.IP.Address.Count;
3. Average.ping.to.attacking.IP.milliseconds;
4. Individual.URLs.requested.
iv. Select only the complete cases using the na.omit(.) function, and name the dataset ML_dataset_cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
c) Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme