Skip to content

sa1ah-ai/iot-intrusion-detection-gradient-boosting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IoT Intrusion Detection using Gradient Boosting DOI

A comprehensive machine learning project for detecting IoT network intrusions using optimized gradient boosting algorithms. This repository contains version 3.0.0 with critical improvements in validation methodology, data leakage prevention, and comprehensive ablation studies.

πŸ“Š Project Overview

This enhanced version addresses the critical challenge of IoT security by developing robust intrusion detection systems using state-of-the-art gradient boosting algorithms. The updated framework includes proper cross-validation, data leakage fixes, and detailed ablation studies to provide more reliable performance evaluations.

🎯 Key Features in v3.0.0

  • Critical Data Leakage Fix: Feature selection and scaling now performed only on training data
  • Enhanced Validation: 5-fold stratified cross-validation for all models
  • Comprehensive Ablation Studies: Direct comparison of full vs. reduced feature sets
  • Multiple Classification Tasks: Binary, 8-class, and 34-class classification
  • Advanced Algorithms: XGBoost, LightGBM, CatBoost with optimized hyperparameters
  • Feature Selection: 50% feature reduction (46β†’23) with minimal accuracy impact
  • Performance Metrics: Accuracy, F1-score, training time, and inference latency analysis

πŸ“ Project Structure

Supplementary_Materials_v3.0.0/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ LICENSE                            # CC BY 4.0 License
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ experiment_scripts/
β”‚   β”œβ”€β”€ binary_classification_experiment.py     # Binary classification (Benign vs Attack)
β”‚   β”œβ”€β”€ 8class_classification_experiment.py     # 8-class attack categorization
β”‚   └── 34class_classification_experiment.py    # 34-class fine-grained classification
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ datasets.tar.xz                # Compressed preprocessed datasets
β”‚   β”œβ”€β”€ binary_train_23features.csv    # Binary training data (23 features)
β”‚   β”œβ”€β”€ binary_test_23features.csv     # Binary testing data (23 features)
β”‚   β”œβ”€β”€ 8class_train_23features.csv    # 8-class training data
β”‚   β”œβ”€β”€ 8class_test_23features.csv     # 8-class testing data
β”‚   β”œβ”€β”€ 34class_train_23features.csv   # 34-class training data
β”‚   └── 34class_test_23features.csv    # 34-class testing data
β”œβ”€β”€ results/ (generated during execution)
β”‚   β”œβ”€β”€ binary_classification/
β”‚   β”‚   β”œβ”€β”€ ablation_study/
β”‚   β”‚   β”œβ”€β”€ classification_reports/
β”‚   β”‚   β”œβ”€β”€ confusion_matrices/
β”‚   β”‚   β”œβ”€β”€ distributions/
β”‚   β”‚   β”œβ”€β”€ feature_selection/
β”‚   β”‚   β”œβ”€β”€ fold_results/
β”‚   β”‚   └── summary/
β”‚   β”œβ”€β”€ 8class_classification/ (similar structure)
β”‚   β”œβ”€β”€ 34class_classification/ (similar structure)
β”‚   └── final_results.xlsx             # Comprehensive summary
└── figures/ (generated during execution)
    β”œβ”€β”€ figures_binary/
    β”œβ”€β”€ figures_8/
    └── figures_multi/

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • 16GB RAM minimum (32GB recommended for full experiments)
  • CUDA-compatible GPU (optional but recommended for acceleration)

Installation

  1. Clone the repository:
git clone https://github.com/sa1ah-ai/iot-intrusion-detection-gradient-boosting.git
cd iot-intrusion-detection-gradient-boosting
  1. Install dependencies:
pip install -r requirements.txt

Extract Preprocessed Datasets

cd datasets
tar -xJf datasets.tar.xz

Run Experiments

# Binary classification (Benign vs Attack)
python experiment_scripts/binary_classification_experiment.py

# 8-class classification (Attack families)
python experiment_scripts/8class_classification_experiment.py

# 34-class classification (All attack types)
python experiment_scripts/34class_classification_experiment.py

πŸ“ˆ Enhanced Results Summary (v3.0.0)

Binary Classification Performance

Algorithm Accuracy Macro-F1 Inference Latency Features
XGBoost (23 features) 99.61% 0.995 0.355 Β΅s/sample 23
XGBoost (46 features) 99.62% 0.995 0.727 Β΅s/sample 46
LightGBM (23 features) 99.51% 0.994 0.412 Β΅s/sample 23
CatBoost (23 features) 99.57% 0.995 0.521 Β΅s/sample 23

Feature Selection Impact

  • Feature Reduction: 46 β†’ 23 features (50% reduction)
  • Accuracy Change: Β±0.014 percentage points
  • Inference Speedup: 30–51% reduction in latency
  • Training Time: 33–40% reduction

8-Class Classification

  • Best Model: XGBoost
  • Categories: DDoS, DoS, Recon, WebBased, BruteForce, Spoofing, Mirai, Benign
  • Performance: High discrimination across all attack families

34-Class Classification

  • Challenge: High class imbalance (34 classes)
  • Solution: Stratified undersampling with minimum samples per class
  • Result: Maintained accuracy with optimized feature set

πŸ”¬ Enhanced Methodology (v3.0.0)

Critical Improvements

  1. Data Leakage Prevention: Train/test split before any feature selection or scaling
  2. Cross-Validation: 5-fold stratified CV for robust performance estimation
  3. Ablation Studies: Direct comparison of full vs. reduced feature sets
  4. Reproducibility: All random seeds fixed, complete logging

Data Pipeline

  1. Stratified Train/Test Split (80/20)
  2. Feature Selection (LightGBM gain-based on training data only)
  3. Standardization (fit on training, transform on test)
  4. Model Training with 5-fold cross-validation
  5. Comprehensive Evaluation with ablation studies

Models Evaluated

  • Gradient Boosting: XGBoost, LightGBM, CatBoost
  • Evaluation: Accuracy, F1-score, training time, inference latency
  • Validation: 5-fold CV with detailed per-fold metrics

🎨 Generated Visualizations

Each experiment automatically generates:

  • Class distribution plots (original and undersampled)
  • Feature importance analysis (gain-based selection)
  • Classification reports (per-class metrics)
  • Confusion matrices (optimized for class count)
  • Ablation study comparisons (performance vs. efficiency)
  • Cross-validation results (fold-by-fold analysis)

πŸ› οΈ Technical Details

Dataset Information

  • Source: CICIoT2023 dataset from Canadian Institute for Cybersecurity
  • Original Size: ~46.7 million samples
  • Processed Size: 4-5 million samples (balanced)
  • Original Features: 46 network traffic attributes
  • Selected Features: 23 most discriminative features
  • Classes: 33 attack types + benign traffic

Hardware Requirements

  • Minimum: 16GB RAM, CPU-only processing
  • Recommended: 32GB RAM, CUDA-compatible GPU
  • Storage: 10GB for datasets and results

Software Dependencies

See requirements.txt for complete list:

  • numpy>=2.1.2
  • pandas>=2.3.0
  • scikit-learn>=1.6.1
  • xgboost>=3.0.2
  • lightgbm>=4.6.0

πŸ“š Research Context

This enhanced version (v3.0.0) provides more reliable evaluations of gradient boosting algorithms for IoT intrusion detection. Key contributions include:

  1. Methodological Rigor: Proper validation preventing data leakage
  2. Efficiency Analysis: Quantified trade-offs between accuracy and inference speed
  3. Feature Selection: Demonstrated 50% reduction with minimal accuracy impact
  4. Reproducibility: Complete code and preprocessed datasets

πŸ†• What's New in v3.0.0

  1. Critical Bug Fix: Data leakage in feature selection pipeline
  2. Enhanced Validation: 5-fold cross-validation for all experiments
  3. Detailed Ablation Studies: Direct comparison of feature sets
  4. Improved Code Structure: Modular scripts with command-line interfaces
  5. Comprehensive Logging: Detailed experiment tracking
  6. Better Visualization: Enhanced plots and summary reports

🀝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

πŸ“„ License

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. Original data were obtained from the CICIoT2023 dataset by the Canadian Institute for Cybersecurity (UNB).

πŸ“ž Contact

For questions or collaboration opportunities:

πŸ™ Acknowledgments

  • Canadian Institute for Cybersecurity for providing the CICIoT2023 dataset
  • Prince Sattam Bin Abdulaziz University for computational resources
  • Research collaboration between Taiz University, Isra University, and Sa'adah University
  • The open-source community for excellent machine learning libraries

πŸ“‹ Version History

  • v3.0.0 (Current): Enhanced version with data leakage fixes, cross-validation, ablation studies
  • v2.1.0: Original supplementary materials release (Zenodo: 17428082)
  • v1.0.0: Initial research implementation

Note: This enhanced version provides more reliable performance evaluations through proper validation methodology. For the original v2.1.0 materials, refer to the Zenodo record https://zenodo.org/records/17428082.

Research Paper: "An Optimized Gradient Boosting Framework for IoT Intrusion Detection: A Comprehensive Evaluation on the CICIoT2023 Dataset"