NETWORK INTRUSION DETECTION

Problem Statement

Goal is to build a predictive model which is capable of distinguishing between bad connections (intrusions or attacks) and good normal connections. In this implementation, the input to the system is a military dataset that consists of encoded information in form of numbers. The output is predicted attack which is either 0 or 1.

Development environment

1. Programming language - Python Language
2. IDE - Jupyter Notebook

Methodology

For predicting the attacks using Military dataset, we have considered all the records from csv file (494020 records). The columns for input are encoded data in numeric format. All the numeric data is normalized, and categorical data is label encoded for better efficiency. Below is the flowchart of process to train various models in scikit-learn as well as TensorFlow.

Details about all features available in dataset

Experimental results and Analysis

We predicted good connections as 0 and bad connections as 1 for 494020 records. By numerous trials, removing duplicates, dropping few unnecessary columns and training TensorFlow models using loop (to avoid the local minima) we found that efficiency of the model gives higher accuracy. Below chart illustrates the details about each model (refer Fig 1 and 2).

Note: The Convolutional Neural Networks with Case = ”Initial” was run with two Convolution and MaxPooling layers. The accuracy was less as compared to CNN with single Convolution and MaxPooling layer.

Additional Features

1. Handle redundant records: In order to handle the redundant records, we have deleted the duplicate records and processed data. This gives higher efficiency and prevents learning algorithms not be biased towards the more frequent records. Total number of records is 494020 and after removal of redundant records, there are 145585 records.

2. Delete unnecessary columns: We have deleted two columns “num_outbound_cmds” and “is_host_login” as all the records in these columns have value 0. This will not make any difference in training the models and it is better to delete these columns.

3. Multi-class classification: We have implemented the multi-class classification and predicted the attack types based on their label encoded values. Two models are implemented – Logistic Regression and TensorFlow Classification to understand how the performance affects using Scikit-learn and TensorFlow models.

Process

• X-label consists of 121 columns from the network_intrusion_detection csv file. All the numeric columns are normalized using z-score and categorical columns are label encoded.

• Y-label is encoded using label encoded format. In case of TensorFlow, the number of classes passed as a second parameter in to_categorical () function are 23 unique attack types (good connections/bad connections).

• We split the data into train and test by using 80% of the data for training and remaining 20% for testing.

• Next step is to train the models by applying Logistic Regression and TensorFlow Classification methods.

Observations

Multi-class gave a bit higher accuracy for Logistic Regression as compared to binary classification. Below is the list of observations for Logistic Regression and Fully-Connected Neural networks (refer Fig 3 for binary and multi-class)