Image source: Google

Network Intrusion Detection System using Machine learning with feature selection techniques

sayoni sinha chowdhury
10 min readFeb 24, 2021

Internet is a global public network and with the growth of the internet traffic there has been an increasing need for security systems. There are both harmless and harmful users on the Internet and the information is available to both the users. Harmful users also known as Malicious users can get access to any organization systems and could cost a big deal to the organization.

Therefore there has been rapid growth in the need for some kind of security to the organization’s private resources. Every organization deployed firewalls to protect their network but no network can be hundred percent secured. Intrusion Detection systems(IDS) work on top of firewalls. The firewall protects the organization from malicious attacks and the IDS detects if someone tries to break in through the firewall and tries to have access on any system in the trusted side and generates an alert.

Table of Content:

  1. Business Problem
  2. Dataset
  3. Exploratory Data Analysis
  4. Performance Metrics
  5. Base Model
  6. Feature Selection Techniques
  7. Different Model Tuning
  8. Conclusion
  9. Future work
  10. References

1.Business Problem:

Before starting on building an Intrusion detection system the first place to start with is to understand the need for intrusion detection and prevention for any organization. An intrusion detection and prevention program needs to be implemented for the following reasons:

  • It is an important part of an overall security strategy.
  • It lets you obtain measurable metrics of actual attacks against your organization’s network.
  • It lets you better manage risk in your organization’s environment without impacting the day-to-day business processes.

Hence our goal is to build a security system that predicts and analyzes the traffic for possible hostile using machine learning .

So, let’s dive deep into it :)

2.Dataset:

There are tons of datasets available online for research and experiment purpose. In this case study I have used the NSL-KDD dataset which is an improved version of KDD Cup dataset without the redundant features. Details about the data can be found and downloaded from here.

The train dataset consists of 125972 data points and 43 features and the test dataset consists of 22543 rows and 43 features.

Let’s import our data and rename the columns.

The column description and the value ranges are described in the below link in much details.

After that I have checked for any duplicate entries and null values in the entire dataset(there weren’t any).

Lets move to the EDA part of the data.

3. Exploratory Data Analysis:

Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Our dataset contains 23 different attack types which are unevenly distributed. I have plotted a pie chart to visualize the distribution of different attack types.

Pie distribution of different attack types

The above plot gives an idea of the percentage value of each class. The normal class covers almost 53% of the data followed by Neptune class which covers 32% and then the rest of the classes each covering less than 3% of the entire dataset.

Next I visualized the univariate analysis of features based on the correlation of the features with the target feature. For the correlation features of categorical features I have used phik coefficient.

Correlation between features and target variables

I have visualized top 10 features distribution based on the target variable from above. Lets look at few of them.

3.1. Plot between wrong fragment and attack types:

Wrong fragment vs attack type

Observation: Most of the records with wrong fragment value “3” and “1” belongs to attack type “teardrop” and “pod”. Now lets remove these types from the target class and visualize the distribution again

3.2. Plot between land and attack type:

Land vs attack type

Observation: As our feature is binary we can say that the data points with Land value 1 belongs to only attack type “land”. We will visualize the distribution again after removing the “land” type from class label.

3.3. Plot between Protocol type and attack type:

Protocol type vs attack type

3.4. Box Plot:

Next, I have plotted the box plot for the features to check for outliers.

3.4.1. Destination Bytes

3.4.2. Hot

3.4.3. Count

From the above plots we can that there are definitely few outliers in our data. We will deal with outliers later in our case.

3.3. TSNE Visualization:

Lets visualize our data in lower dimension and see if we can classify our points.

TSNE Visualization

From the TSNE visualization we can see that the attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

3.4. PCA Visualization:

PCA

We can observe similar behavior from PCA as well. Attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

Now that we are done with our EDA lets move to data preprocessing and building a base model and see how my model is performing.

Our dataset has 23 attack types which can be categorized into 4 major network attack types. Hence I have replaced the different types with the 4 attack types.

4.Performance Metrics:

For our problem we want lower false positive rate because if the model wrongly predicts a normal traffic as malicious which if later be blocked by the network which is likely to negatively impact business functions. Hence I have focused on FPR, Precision, recall , micro f1 score and Confusion matrix as an evaluation criteria.

5. Base Model:

Next I have label encoded my categorical features and then used standard scaler to each of my features. I have then used used linear Support Vector Machine as my base model and dumped all my train data into the model for training. I have then evaluated my model performance on this dataset.

Base model

Let’s evaluate the performance of the base model.

We are getting F1 Score of around 78% and the FPR for all the 4 classes are also not bad. Lets see if we can improve this by different feature selection techniques.

6. Feature Selection Techniques:

6.1. Correlation based feature Selection( CFS)

Here I am focus on correlation between the features with the target variable. Hence I has set a threshold of 0.5 and features which has correlation of more than that threshold will likely to have more impact on the target variables and hence will be selected as part of my new dataset.

I have then standardized my data with the new set of features and then applied multiple ML algorithms like Gaussian Naive Bayes, K Nearest Neighbor, One Vs Rest Classifier, Random Forest, Decision Tree, XGBoost, Linear SVM and Catboost Classifier on the updated dataset and compared the performance with each other.

6.2. Let’s try other feature selection technique- Information Gain

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency. Here I have used mutual info classifier from sklearn.feature_selection.

MI of different features

Let’s plot the different feature importance for visualization.

Based on the above values I have dropped features which has least MI value from both train and test set and trained all the above mentioned models on this new dataset.

I have also trained my entire train data with all 43 features on each of these models as well so that we can compare for any considerable increase in performance.

Evaluation on full dataset.

Let’s compare our result for all 3 of the above.

From the above tabulated result we can conclude that MI based feature selection method is working best for few of the classifiers. I will therefore build my final model with MI subset of data and perform tuning of different Machine learning models to improve the performance.

In the next step I will remove the outliers from my dataset.

7. Different Model Tuning:

7.1. Logistic Regression

Logistic regression gives me 84% F1 Score and 84% precision. Also we can see much reduction in FPR for class 2. Lets look at the confusion matrix.

Confusion matrix of Logistic Regression

The model is performing good for attack type DOS and Normal, however for attack type U2R belonging to the minority class it can still be improved.

Let’s try other models.

7.2. Random Forest

Random Forest gives me 75% F1 Score and 75% precision. I do not see much reduction in FPR. Let’s look at the confusion matrix.

Confusion matrix of Random Forest

Let’s look at another model.

7.3. Decision Tree

Decision Tree is performing better than Random Forest on unseen data. Let’s plot the confusion matrix.

Confusion matrix of Decision Tree

As of now we have received best performance from Logistic Regression.

7.4. Linear Support Vector Machine

SVC is giving us an accuracy of 82% with good FPR values. I’ll plot the confusion matrix as well.

Confusion matrix of SVM

7.5. Catboost Classifier

Catboost classifier is giving me accuracy of 73%. I will try few other algorithms as well.

Confusion matrix of Catboost

7.6. K Nearest Neighbor

KNN is giving me accuracy of 79%.

Confusion matrix of KNN

7.7. XGBoost Classifier

My XGBoost model is also not improving the performance.

Confusion matrix of XGBoost

7.8. MLP Classifier

I can reach an accuracy of up to 80% using MLP Classifier. Let’s visualize the confusion matrix.

Confusion matrix of MLP Classifier

The classifier is performing poorly on minority class.

7.9. Stacking CV Classifier

Stacking CV Classifier is an an ensemble-learning meta-classifier for stacking using cross-validation to prepare the inputs for the level-2 classifier to prevent overfitting. You can refer the below link for more details.

On stacking my different models and tuning the hyperparameter of my meta classifier I do not see improvements in my model performance.

8.Conclusion:

Tabulated Result

From all the above models we can conclude that Decision Tree is performing best in classifying minority class however Logistic Regression Model is working best overall and giving us best accuracy. Hence, we will save our model1(Logistic Regression model) and use it for prediction.

For the entire code please refer to my GitHub profile.

My Linked In profile:

linkedin.com/in/sayoni-sinha-9442474b

9.Future Work:

For further work and improvements on our performance on minority class we can use SMOTE algorithm to deal with the uneven distribution of the data.

10.References:

  1. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.680.6760&rep=rep1&type=pdf#:~:text=The%20data%20in%20NSL%2DKDD,in%20weka%20tool%20%5B11%5D
  2. https://publications.waset.org/3936/network-intrusion-detection-design-using-feature-selection-of-soft-computing-paradigms
  3. https://www.appliedaicourse.com/

--

--