Compare performance (accuracy, time, and other metrics) of anomaly IDS that are based on decision tree and MLP

SPECIAL TOPICS: CYBERSECURITY
Lab 2: Developing Intrusion Detection System Using Machine
Learning
Lab Description: This lab is to write the python script to implement a binary and multiclass
classifier to estimate whether a network package belongs to an intrusion attack or not. The
dataset contains observation of 22 attacks and one dataset of normal network activity. Each
record within the dataset contains 41 features.
In this lab, you will use different methods and tools to design an IDS. You will compare them based
on their performance and resource consumption. In order to evaluate the quality of your intrusion
detection you will have to calculate two indicators: the false positives ratio (the ratio of the cases,
when your IDS signals an attack, while there is no actual attack, to all cases) and the false negatives
ratio (the ratio of the cases when your IDS does not detect an attack, while there exists an actual
attack, to all cases) for each attack. These performance indicators are typical for the industry.
Some publications introduce an integral performance indicator by summing up those two.
After that you will have to analyze ratios calculated and determine if those ratios are high for
certain attack types. It will mean that your design does not work for those attacks. In this case,
you will have to analyze your results and try to determine the reasons why they are not good.
Dataset description: These data are based on the benchmark of the Defense Advanced Research
Projects Agency (DARPA) that was collected by the Lincoln Laboratory of Massachusetts Institute
of Technology in 1998, and was the first initiative to provide designers of Intrusion Detection
Systems with a benchmark, on which to evaluate different methodologies [see DARPA, Intrusion
Detection Evaluation. MIT Lincoln Laboratory, 1998
(https://www.ll.mit.edu/ideval/data/1998data.html)]. In order to collect these data, a simulation
had been made of a factitious military network consisting of three “target” machines running
various operating systems and services. Additional three machines were then used to spoof
different IP addresses, thus generating traffic between different IP addresses. Finally, a sniffer was
used to record all network traffic using the TCP dump format. The total simulated period was
seven weeks. Normal connections were created to profile what commonly expected in a military
network. Attacks fall into one of five categories: User to Root (U2R), Remote to Local (R2L), Denial
of Service (DOS), Data, and Probe. Packets’ information in the TCP dump files was summarized
into connections. Specifically, a connection was a sequence of TCP packets starting and ending at
some well-defined times, between which data flows from a source IP address to a target IP
address under some well-defined protocol. In 1999 the original TCP dump files were preprocessed
for utilization in the IDS benchmark of the International Knowledge Discovery and Data Mining
Tools Competitions. These data were compiled into different archives. You have to review those
archives and choose suitable for your further work.
Original Data Format for KDD data
The data description consists of a number of basic features:
1. Duration of the connection.
2. Protocol type, such as TCP, UDP or ICMP.
3. Service type, such as FTP, HTTP, Telnet.
4. Status flag.
5. Total bytes sent to destination host.
6. Total bytes sent to source host.
7. Whether source and destination addresses are the same or not.
8. Number of wrong fragments.
9. Number of urgent packets.
In addition to the above nine “basic features”, each connection is also described in terms of an
additional 32 derived features, falling into three categories:
1. Content features: Domain knowledge is used to assess the payload of the original TCP packets.
This includes features such as the number of failed login attempts.
2. Time-based traffic features: these features are designed to capture properties that mature over
a 2 second temporal window. One example of such a feature would be the number of connections
to the same host over the 2 second interval.
3. Host-based traffic features: utilize a historical window estimated over the number of
connections – in this case 100 – instead of time. Host based features are therefore designed to
assess attacks, which span intervals longer than 2 seconds.
Each record consists of 41 attributes and one target value that represents either the attack name
or normal (no attack) situation. (See: Lee W., S. Stolfo, and K. Mok, Mining in a Data-Flow
Environment: Experience in Network Intrusion Detection. In Proceedings of the 5th ACM SIGKDD,
1999 and Lee, W., S.J. Stolfo, and K.W. Mok, A Data Mining Framework for Building Intrusion
Detection Models. IEEE Symposium on Security and Privacy, 1999 for more information about the
record formats and attribute characteristics).
Lab pipeline:
1. Install Anaconda: http://docs.anaconda.com/anaconda/install.html
2. Create myidsenv environment (conda create –name myidsenv)
3. Activate myidsenv environment (conda activate myidsenv)
4. Install SkLearn package (pip install sklearn)
Decision Tree:
1. Navigate to the folder with IDS based on decision tree
2. Run anomaly IDS (python anomaly_ids.py) and save confusion matrix and timings.
3. Run misuse IDS (python misuse_ids.py) and save confusion matrix and timings.
4. Calculate accuracy, precision, recall, and F1-score for anomaly IDS
5. Calculate accuracy for misuse IDS.
Neural network:
In these experiments you can change neural network hyper parameters and its structure to build
a better model.
Hyper parameters:
1. batch_size – Size of minibatches for stochastic optimizers.
2. n_iter_no_change – Maximum number of epochs to not meet tol improvement.
3. tol – Tolerance for the optimization. When the loss or score is not improving by at least tol
for n_iter_no_change consecutive iterations, convergence is considered to be reached and
training stops.
Neural network structure:
You can change number and structure of hidden layers by modifying ‘hidden_layer_sizes’. For
example, ‘hidden_layer_sizes=(10, 3)’ means that your network has two hidden layers: first hidden
layer contains 10 neurons and second contains three neurons.
Experiment steps:
1. Navigate to the folder with IDS based on MLP
2. Run anomaly IDS (python anomaly_ids.py) and save confusion matrix and timings.
3. Try various combinations of hyper-parameters and neural network structure to
outperform anomaly IDS based on decision tree. Write down results and parameters each
time.
4. Run misuse IDS (python misuse_ids.py) and save confusion matrix and timings.
5. Try various combinations of hyper-parameters and neural network structure to
outperform misuse IDS based on decision tree. Write down results and parameters each
time.
6. Run IDS for detection “Satan” attack (python satan_ids.py) and save confusion matrix and
timings.
7. Try various combinations of hyper-parameters and neural network structure to achieve
better results. Write down results and parameters each time.
8. Calculate accuracy, precision, recall, and F1-score for anomaly IDS.
9. Calculate accuracy, precision, recall, and F1-score for “satan” detection IDS.
10. Calculate accuracy for misuse IDS.
Analysis:
1. Compare performance (accuracy, time, and other metrics) of anomaly IDS that are based
on decision tree and MLP
2. Compare performance (accuracy, time, and other metrics) of misuse IDS that are based on
decision tree and MLP
3. Which type IDS works better? Why?
4. What conclusion can be made for Satan IDS?

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
error: Content is protected !!