Network Intrusion Detection System using Machine learning with feature selection techniques

10 min readFeb 24, 2021

Internet is a global public network and with the growth of the internet traffic there has been an increasing need for security systems. There are both harmless and harmful users on the Internet and the information is available to both the users. Harmful users also known as Malicious users can get access to any organization systems and could cost a big deal to the organization.

Therefore there has been rapid growth in the need for some kind of security to the organization’s private resources. Every organization deployed firewalls to protect their network but no network can be hundred percent secured. Intrusion Detection systems(IDS) work on top of firewalls. The firewall protects the organization from malicious attacks and the IDS detects if someone tries to break in through the firewall and tries to have access on any system in the trusted side and generates an alert.

Table of Content:

Business Problem
Dataset
Exploratory Data Analysis
Performance Metrics
Base Model
Feature Selection Techniques
Different Model Tuning
Conclusion
Future work
References

1.Business Problem:

Before starting on building an Intrusion detection system the first place to start with is to understand the need for intrusion detection and prevention for any organization. An intrusion detection and prevention program needs to be implemented for the following reasons:

It is an important part of an overall security strategy.
It lets you obtain measurable metrics of actual attacks against your organization’s network.
It lets you better manage risk in your organization’s environment without impacting the day-to-day business processes.

Hence our goal is to build a security system that predicts and analyzes the traffic for possible hostile using machine learning .

So, let’s dive deep into it :)

2.Dataset:

There are tons of datasets available online for research and experiment purpose. In this case study I have used the NSL-KDD dataset which is an improved version of KDD Cup dataset without the redundant features. Details about the data can be found and downloaded from here.

NSL-KDD | Datasets | Research | Canadian Institute for Cybersecurity | UNB

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set which are mentioned in…

www.unb.ca

The train dataset consists of 125972 data points and 43 features and the test dataset consists of 22543 rows and 43 features.

Let’s import our data and rename the columns.

The column description and the value ranges are described in the below link in much details.

NSL-KDD Features

Sheet1 #,Feature Name,Description,Type,Value Type,Ranges (Between both train and test) 1,Duration,Length of time…

docs.google.com

After that I have checked for any duplicate entries and null values in the entire dataset(there weren’t any).

Lets move to the EDA part of the data.

3. Exploratory Data Analysis:

Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Our dataset contains 23 different attack types which are unevenly distributed. I have plotted a pie chart to visualize the distribution of different attack types.

Pie distribution of different attack types

The above plot gives an idea of the percentage value of each class. The normal class covers almost 53% of the data followed by Neptune class which covers 32% and then the rest of the classes each covering less than 3% of the entire dataset.

Next I visualized the univariate analysis of features based on the correlation of the features with the target feature. For the correlation features of categorical features I have used phik coefficient.

Correlation between features and target variables

I have visualized top 10 features distribution based on the target variable from above. Lets look at few of them.

3.1. Plot between wrong fragment and attack types:

Observation: Most of the records with wrong fragment value “3” and “1” belongs to attack type “teardrop” and “pod”. Now lets remove these types from the target class and visualize the distribution again

3.2. Plot between land and attack type:

Observation: As our feature is binary we can say that the data points with Land value 1 belongs to only attack type “land”. We will visualize the distribution again after removing the “land” type from class label.

3.3. Plot between Protocol type and attack type:

3.4. Box Plot:

Next, I have plotted the box plot for the features to check for outliers.

3.4.1. Destination Bytes

3.4.2. Hot

3.4.3. Count

From the above plots we can that there are definitely few outliers in our data. We will deal with outliers later in our case.

3.3. TSNE Visualization:

Lets visualize our data in lower dimension and see if we can classify our points.

From the TSNE visualization we can see that the attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

3.4. PCA Visualization:

We can observe similar behavior from PCA as well. Attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

Now that we are done with our EDA lets move to data preprocessing and building a base model and see how my model is performing.

Our dataset has 23 attack types which can be categorized into 4 major network attack types. Hence I have replaced the different types with the 4 attack types.

4.Performance Metrics:

For our problem we want lower false positive rate because if the model wrongly predicts a normal traffic as malicious which if later be blocked by the network which is likely to negatively impact business functions. Hence I have focused on FPR, Precision, recall , micro f1 score and Confusion matrix as an evaluation criteria.

5. Base Model:

Next I have label encoded my categorical features and then used standard scaler to each of my features. I have then used used linear Support Vector Machine as my base model and dumped all my train data into the model for training. I have then evaluated my model performance on this dataset.

Base model

Let’s evaluate the performance of the base model.

We are getting F1 Score of around 78% and the FPR for all the 4 classes are also not bad. Lets see if we can improve this by different feature selection techniques.

6. Feature Selection Techniques:

6.1. Correlation based feature Selection( CFS)

Here I am focus on correlation between the features with the target variable. Hence I has set a threshold of 0.5 and features which has correlation of more than that threshold will likely to have more impact on the target variables and hence will be selected as part of my new dataset.

I have then standardized my data with the new set of features and then applied multiple ML algorithms like Gaussian Naive Bayes, K Nearest Neighbor, One Vs Rest Classifier, Random Forest, Decision Tree, XGBoost, Linear SVM and Catboost Classifier on the updated dataset and compared the performance with each other.

6.2. Let’s try other feature selection technique- Information Gain

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency. Here I have used mutual info classifier from sklearn.feature_selection.

Let’s plot the different feature importance for visualization.

Based on the above values I have dropped features which has least MI value from both train and test set and trained all the above mentioned models on this new dataset.

I have also trained my entire train data with all 43 features on each of these models as well so that we can compare for any considerable increase in performance.

Let’s compare our result for all 3 of the above.

From the above tabulated result we can conclude that MI based feature selection method is working best for few of the classifiers. I will therefore build my final model with MI subset of data and perform tuning of different Machine learning models to improve the performance.

In the next step I will remove the outliers from my dataset.

7. Different Model Tuning:

7.1. Logistic Regression

Logistic regression gives me 84% F1 Score and 84% precision. Also we can see much reduction in FPR for class 2. Lets look at the confusion matrix.

The model is performing good for attack type DOS and Normal, however for attack type U2R belonging to the minority class it can still be improved.

Let’s try other models.

7.2. Random Forest

Random Forest gives me 75% F1 Score and 75% precision. I do not see much reduction in FPR. Let’s look at the confusion matrix.

Let’s look at another model.

7.3. Decision Tree

Decision Tree is performing better than Random Forest on unseen data. Let’s plot the confusion matrix.

As of now we have received best performance from Logistic Regression.

7.4. Linear Support Vector Machine

SVC is giving us an accuracy of 82% with good FPR values. I’ll plot the confusion matrix as well.

7.5. Catboost Classifier

Catboost classifier is giving me accuracy of 73%. I will try few other algorithms as well.

7.6. K Nearest Neighbor

KNN is giving me accuracy of 79%.

7.7. XGBoost Classifier

My XGBoost model is also not improving the performance.

7.8. MLP Classifier

I can reach an accuracy of up to 80% using MLP Classifier. Let’s visualize the confusion matrix.

The classifier is performing poorly on minority class.

7.9. Stacking CV Classifier

Stacking CV Classifier is an an ensemble-learning meta-classifier for stacking using cross-validation to prepare the inputs for the level-2 classifier to prevent overfitting. You can refer the below link for more details.

StackingCVClassifier

An ensemble-learning meta-classifier for stacking using cross-validation to prepare the inputs for the level-2…

rasbt.github.io

On stacking my different models and tuning the hyperparameter of my meta classifier I do not see improvements in my model performance.

8.Conclusion:

From all the above models we can conclude that Decision Tree is performing best in classifying minority class however Logistic Regression Model is working best overall and giving us best accuracy. Hence, we will save our model1(Logistic Regression model) and use it for prediction.

For the entire code please refer to my GitHub profile.

sayoni18/Network-Intrusion-Detection

Contribute to sayoni18/Network-Intrusion-Detection development by creating an account on GitHub.

github.com

My Linked In profile:

linkedin.com/in/sayoni-sinha-9442474b

9.Future Work:

For further work and improvements on our performance on minority class we can use SMOTE algorithm to deal with the uneven distribution of the data.