A comprehensive machine learning project for detecting IoT network intrusions using optimized gradient boosting algorithms. This repository contains version 3.0.0 with critical improvements in validation methodology, data leakage prevention, and comprehensive ablation studies.
This enhanced version addresses the critical challenge of IoT security by developing robust intrusion detection systems using state-of-the-art gradient boosting algorithms. The updated framework includes proper cross-validation, data leakage fixes, and detailed ablation studies to provide more reliable performance evaluations.
- Critical Data Leakage Fix: Feature selection and scaling now performed only on training data
- Enhanced Validation: 5-fold stratified cross-validation for all models
- Comprehensive Ablation Studies: Direct comparison of full vs. reduced feature sets
- Multiple Classification Tasks: Binary, 8-class, and 34-class classification
- Advanced Algorithms: XGBoost, LightGBM, CatBoost with optimized hyperparameters
- Feature Selection: 50% feature reduction (46β23) with minimal accuracy impact
- Performance Metrics: Accuracy, F1-score, training time, and inference latency analysis
Supplementary_Materials_v3.0.0/
βββ README.md # This file
βββ LICENSE # CC BY 4.0 License
βββ requirements.txt # Python dependencies
βββ experiment_scripts/
β βββ binary_classification_experiment.py # Binary classification (Benign vs Attack)
β βββ 8class_classification_experiment.py # 8-class attack categorization
β βββ 34class_classification_experiment.py # 34-class fine-grained classification
βββ datasets/
β βββ datasets.tar.xz # Compressed preprocessed datasets
β βββ binary_train_23features.csv # Binary training data (23 features)
β βββ binary_test_23features.csv # Binary testing data (23 features)
β βββ 8class_train_23features.csv # 8-class training data
β βββ 8class_test_23features.csv # 8-class testing data
β βββ 34class_train_23features.csv # 34-class training data
β βββ 34class_test_23features.csv # 34-class testing data
βββ results/ (generated during execution)
β βββ binary_classification/
β β βββ ablation_study/
β β βββ classification_reports/
β β βββ confusion_matrices/
β β βββ distributions/
β β βββ feature_selection/
β β βββ fold_results/
β β βββ summary/
β βββ 8class_classification/ (similar structure)
β βββ 34class_classification/ (similar structure)
β βββ final_results.xlsx # Comprehensive summary
βββ figures/ (generated during execution)
βββ figures_binary/
βββ figures_8/
βββ figures_multi/
- Python 3.8+
- 16GB RAM minimum (32GB recommended for full experiments)
- CUDA-compatible GPU (optional but recommended for acceleration)
- Clone the repository:
git clone https://github.com/sa1ah-ai/iot-intrusion-detection-gradient-boosting.git
cd iot-intrusion-detection-gradient-boosting- Install dependencies:
pip install -r requirements.txtcd datasets
tar -xJf datasets.tar.xz# Binary classification (Benign vs Attack)
python experiment_scripts/binary_classification_experiment.py
# 8-class classification (Attack families)
python experiment_scripts/8class_classification_experiment.py
# 34-class classification (All attack types)
python experiment_scripts/34class_classification_experiment.py| Algorithm | Accuracy | Macro-F1 | Inference Latency | Features |
|---|---|---|---|---|
| XGBoost (23 features) | 99.61% | 0.995 | 0.355 Β΅s/sample | 23 |
| XGBoost (46 features) | 99.62% | 0.995 | 0.727 Β΅s/sample | 46 |
| LightGBM (23 features) | 99.51% | 0.994 | 0.412 Β΅s/sample | 23 |
| CatBoost (23 features) | 99.57% | 0.995 | 0.521 Β΅s/sample | 23 |
- Feature Reduction: 46 β 23 features (50% reduction)
- Accuracy Change: Β±0.014 percentage points
- Inference Speedup: 30β51% reduction in latency
- Training Time: 33β40% reduction
- Best Model: XGBoost
- Categories: DDoS, DoS, Recon, WebBased, BruteForce, Spoofing, Mirai, Benign
- Performance: High discrimination across all attack families
- Challenge: High class imbalance (34 classes)
- Solution: Stratified undersampling with minimum samples per class
- Result: Maintained accuracy with optimized feature set
- Data Leakage Prevention: Train/test split before any feature selection or scaling
- Cross-Validation: 5-fold stratified CV for robust performance estimation
- Ablation Studies: Direct comparison of full vs. reduced feature sets
- Reproducibility: All random seeds fixed, complete logging
- Stratified Train/Test Split (80/20)
- Feature Selection (LightGBM gain-based on training data only)
- Standardization (fit on training, transform on test)
- Model Training with 5-fold cross-validation
- Comprehensive Evaluation with ablation studies
- Gradient Boosting: XGBoost, LightGBM, CatBoost
- Evaluation: Accuracy, F1-score, training time, inference latency
- Validation: 5-fold CV with detailed per-fold metrics
Each experiment automatically generates:
- Class distribution plots (original and undersampled)
- Feature importance analysis (gain-based selection)
- Classification reports (per-class metrics)
- Confusion matrices (optimized for class count)
- Ablation study comparisons (performance vs. efficiency)
- Cross-validation results (fold-by-fold analysis)
- Source: CICIoT2023 dataset from Canadian Institute for Cybersecurity
- Original Size: ~46.7 million samples
- Processed Size: 4-5 million samples (balanced)
- Original Features: 46 network traffic attributes
- Selected Features: 23 most discriminative features
- Classes: 33 attack types + benign traffic
- Minimum: 16GB RAM, CPU-only processing
- Recommended: 32GB RAM, CUDA-compatible GPU
- Storage: 10GB for datasets and results
See requirements.txt for complete list:
- numpy>=2.1.2
- pandas>=2.3.0
- scikit-learn>=1.6.1
- xgboost>=3.0.2
- lightgbm>=4.6.0
This enhanced version (v3.0.0) provides more reliable evaluations of gradient boosting algorithms for IoT intrusion detection. Key contributions include:
- Methodological Rigor: Proper validation preventing data leakage
- Efficiency Analysis: Quantified trade-offs between accuracy and inference speed
- Feature Selection: Demonstrated 50% reduction with minimal accuracy impact
- Reproducibility: Complete code and preprocessed datasets
- Critical Bug Fix: Data leakage in feature selection pipeline
- Enhanced Validation: 5-fold cross-validation for all experiments
- Detailed Ablation Studies: Direct comparison of feature sets
- Improved Code Structure: Modular scripts with command-line interfaces
- Comprehensive Logging: Detailed experiment tracking
- Better Visualization: Enhanced plots and summary reports
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. Original data were obtained from the CICIoT2023 dataset by the Canadian Institute for Cybersecurity (UNB).
For questions or collaboration opportunities:
- GitHub Issues: [Repository Issues Page]
- Corresponding Author: Adel A. Nasser (adel@saada-uni.edu.ye)
- Canadian Institute for Cybersecurity for providing the CICIoT2023 dataset
- Prince Sattam Bin Abdulaziz University for computational resources
- Research collaboration between Taiz University, Isra University, and Sa'adah University
- The open-source community for excellent machine learning libraries
- v3.0.0 (Current): Enhanced version with data leakage fixes, cross-validation, ablation studies
- v2.1.0: Original supplementary materials release (Zenodo: 17428082)
- v1.0.0: Initial research implementation
Note: This enhanced version provides more reliable performance evaluations through proper validation methodology. For the original v2.1.0 materials, refer to the Zenodo record https://zenodo.org/records/17428082.
Research Paper: "An Optimized Gradient Boosting Framework for IoT Intrusion Detection: A Comprehensive Evaluation on the CICIoT2023 Dataset"