Image source: Google

Network Intrusion Detection System using Machine learning with feature selection techniques

Internet is a global public network and with the growth of the internet traffic there has been an increasing need for security systems. There are both harmless and harmful users on the Internet and the information is available to both the users. Harmful users also known as Malicious users can get access to any organization systems and could cost a big deal to the organization.

Therefore there has been rapid growth in the need for some kind of security to the organization’s private resources. Every organization deployed firewalls to protect their network but no network can be hundred percent secured. Intrusion Detection systems(IDS) work on top of firewalls. The firewall protects the organization from malicious attacks and the IDS detects if someone tries to break in through the firewall and tries to have access on any system in the trusted side and generates an alert.

Table of Content:

  1. Business Problem
  2. Dataset
  3. Exploratory Data Analysis
  4. Performance Metrics
  5. Base Model
  6. Feature Selection Techniques
  7. Different Model Tuning
  8. Conclusion
  9. Future work
  10. References

1.Business Problem:

  • It is an important part of an overall security strategy.
  • It lets you obtain measurable metrics of actual attacks against your organization’s network.
  • It lets you better manage risk in your organization’s environment without impacting the day-to-day business processes.

Hence our goal is to build a security system that predicts and analyzes the traffic for possible hostile using machine learning .

So, let’s dive deep into it :)


The train dataset consists of 125972 data points and 43 features and the test dataset consists of 22543 rows and 43 features.

Let’s import our data and rename the columns.

The column description and the value ranges are described in the below link in much details.

After that I have checked for any duplicate entries and null values in the entire dataset(there weren’t any).

Lets move to the EDA part of the data.

3. Exploratory Data Analysis:

Our dataset contains 23 different attack types which are unevenly distributed. I have plotted a pie chart to visualize the distribution of different attack types.

Pie distribution of different attack types

The above plot gives an idea of the percentage value of each class. The normal class covers almost 53% of the data followed by Neptune class which covers 32% and then the rest of the classes each covering less than 3% of the entire dataset.

Next I visualized the univariate analysis of features based on the correlation of the features with the target feature. For the correlation features of categorical features I have used phik coefficient.

Correlation between features and target variables

I have visualized top 10 features distribution based on the target variable from above. Lets look at few of them.

3.1. Plot between wrong fragment and attack types:

Wrong fragment vs attack type

Observation: Most of the records with wrong fragment value “3” and “1” belongs to attack type “teardrop” and “pod”. Now lets remove these types from the target class and visualize the distribution again

3.2. Plot between land and attack type:

Land vs attack type

Observation: As our feature is binary we can say that the data points with Land value 1 belongs to only attack type “land”. We will visualize the distribution again after removing the “land” type from class label.

3.3. Plot between Protocol type and attack type:

Protocol type vs attack type

3.4. Box Plot:

3.4.1. Destination Bytes

3.4.2. Hot

3.4.3. Count

From the above plots we can that there are definitely few outliers in our data. We will deal with outliers later in our case.

3.3. TSNE Visualization:

TSNE Visualization

From the TSNE visualization we can see that the attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

3.4. PCA Visualization:


We can observe similar behavior from PCA as well. Attack type normal and Neptune can be classified however minority classes can be seen as overlapping.

Now that we are done with our EDA lets move to data preprocessing and building a base model and see how my model is performing.

Our dataset has 23 attack types which can be categorized into 4 major network attack types. Hence I have replaced the different types with the 4 attack types.

4.Performance Metrics:

5. Base Model:

Base model

Let’s evaluate the performance of the base model.

We are getting F1 Score of around 78% and the FPR for all the 4 classes are also not bad. Lets see if we can improve this by different feature selection techniques.

6. Feature Selection Techniques:

6.1. Correlation based feature Selection( CFS)

I have then standardized my data with the new set of features and then applied multiple ML algorithms like Gaussian Naive Bayes, K Nearest Neighbor, One Vs Rest Classifier, Random Forest, Decision Tree, XGBoost, Linear SVM and Catboost Classifier on the updated dataset and compared the performance with each other.

6.2. Let’s try other feature selection technique- Information Gain

MI of different features

Let’s plot the different feature importance for visualization.

Based on the above values I have dropped features which has least MI value from both train and test set and trained all the above mentioned models on this new dataset.

I have also trained my entire train data with all 43 features on each of these models as well so that we can compare for any considerable increase in performance.

Evaluation on full dataset.

Let’s compare our result for all 3 of the above.

From the above tabulated result we can conclude that MI based feature selection method is working best for few of the classifiers. I will therefore build my final model with MI subset of data and perform tuning of different Machine learning models to improve the performance.

In the next step I will remove the outliers from my dataset.

7. Different Model Tuning:

7.1. Logistic Regression

Logistic regression gives me 84% F1 Score and 84% precision. Also we can see much reduction in FPR for class 2. Lets look at the confusion matrix.

Confusion matrix of Logistic Regression

The model is performing good for attack type DOS and Normal, however for attack type U2R belonging to the minority class it can still be improved.

Let’s try other models.

7.2. Random Forest

Random Forest gives me 75% F1 Score and 75% precision. I do not see much reduction in FPR. Let’s look at the confusion matrix.

Confusion matrix of Random Forest

Let’s look at another model.

7.3. Decision Tree

Decision Tree is performing better than Random Forest on unseen data. Let’s plot the confusion matrix.

Confusion matrix of Decision Tree

As of now we have received best performance from Logistic Regression.

7.4. Linear Support Vector Machine

SVC is giving us an accuracy of 82% with good FPR values. I’ll plot the confusion matrix as well.

Confusion matrix of SVM

7.5. Catboost Classifier

Catboost classifier is giving me accuracy of 73%. I will try few other algorithms as well.

Confusion matrix of Catboost

7.6. K Nearest Neighbor

KNN is giving me accuracy of 79%.

Confusion matrix of KNN

7.7. XGBoost Classifier

My XGBoost model is also not improving the performance.

Confusion matrix of XGBoost

7.8. MLP Classifier

I can reach an accuracy of up to 80% using MLP Classifier. Let’s visualize the confusion matrix.

Confusion matrix of MLP Classifier

The classifier is performing poorly on minority class.

7.9. Stacking CV Classifier

On stacking my different models and tuning the hyperparameter of my meta classifier I do not see improvements in my model performance.


Tabulated Result

From all the above models we can conclude that Decision Tree is performing best in classifying minority class however Logistic Regression Model is working best overall and giving us best accuracy. Hence, we will save our model1(Logistic Regression model) and use it for prediction.

For the entire code please refer to my GitHub profile.

My Linked In profile:

9.Future Work:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store