HPC-AI-Optimization-Lab is an educational and production-ready CUDA kernel library designed for AI inference workloads. It provides step-by-step optimized implementations of critical GPU operations, from basic elementwise operations to advanced Tensor Core matrix multiplication.
- 📚 Progressive Learning Path: Each module demonstrates optimization techniques from naive to expert-level
- 🔬 Production Quality: All kernels include comprehensive test coverage (GoogleTest + RapidCheck)
- 🚀 Modern C++20: Uses concepts, RAII, and contemporary design patterns
- 🐍 Python Bindings: Optional nanobind-based Python interface for rapid prototyping
hpc-ai-optimization-lab/
├── src/
│ ├── common/ # Shared utilities (Tensor, Timer, CUDA checks)
│ ├── 01_elementwise/ # ReLU, Sigmoid, Vector Add, Transpose
│ ├── 02_reduction/ # Softmax, LayerNorm, RMSNorm
│ ├── 03_gemm/ # 7-step GEMM optimization journey
│ ├── 04_convolution/ # Implicit GEMM, Winograd convolution
│ ├── 05_attention/ # FlashAttention, RoPE, TopK
│ ├── 06_quantization/ # INT8/FP8 quantization utilities
│ └── 07_cuda13_features/ # Experimental Hopper architecture features
├── tests/ # Comprehensive test suite
├── examples/ # CUDA and Python examples
├── python/ # Nanobind Python bindings
└── docs/ # Technical documentation
| Requirement | Version |
|---|---|
| CUDA Toolkit | 12.4+ |
| CMake | 3.24+ |
| C++ Compiler | GCC 11+ / Clang 14+ / MSVC 2022+ |
| NVIDIA GPU | Compute Capability 7.0+ |
# Clone repository
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab
# Configure and build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir build --output-on-failurecmake -S . -B build -DBUILD_EXAMPLES=ON
cmake --build build --target relu_example gemm_benchmark
# Run examples
./build/examples/relu_example
./build/examples/gemm_benchmarkcmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
# Verify installation
python -c "import hpc_ai_opt; print('Module loaded successfully!')"
# Run Python example
python examples/python/basic_usage.pyOptimizations covered:
- Naive implementation
- Vectorized loads/stores (
float4) - Grid stride loops for arbitrary input sizes
- Shared memory for transpose operations
#include "01_elementwise/relu.cuh"
// Using optimized Grid Stride implementation
hpc::elementwise::relu<float, hpc::elementwise::OptLevel::GridStride>(
d_input, d_output, n, stream);Optimizations covered:
- Warp shuffle primitives
- Block-level reduction
- Online Softmax algorithm
- Welford's algorithm for numerical stability
#include "02_reduction/softmax.cuh"
// Online Softmax - single pass algorithm
hpc::reduction::softmax<float, hpc::reduction::SoftmaxOpt::OnlineSoftmax>(
d_input, d_output, batch, seq_len, stream);The flagship module demonstrating progressive GEMM optimization:
| Step | Technique | FP32 TFLOPS | Key Insight |
|---|---|---|---|
| 1 | Naive | ~0.5 | Baseline - each thread computes one element |
| 2 | Shared Memory Tiling | ~2.0 | Reduce global memory access by TILE_SIZE |
| 3 | Double Buffering | ~3.5 | Hide memory latency with computation overlap |
| 4 | Register Tiling | ~6.0 | Reduce shared memory bank conflicts |
| 5 | Tensor Core WMMA | ~50+ | Hardware-accelerated matrix operations |
| 6 | Tensor Core MMA PTX | ~60+ | Fine-grained Tensor Core control |
| 7 | Software Pipelining | ~70+ | Multi-stage execution overlap |
#include "03_gemm/gemm.cuh"
// Using Tensor Core optimization
hpc::gemm::gemm<__half, hpc::gemm::GemmOpt::TensorCoreWMMA>(
d_A, d_B, d_C, M, N, K, 1.0f, 0.0f, stream);- Implicit GEMM convolution (validated, production-ready)
- Winograd convolution (3×3 kernels, experimental fallback)
- FlashAttention forward pass with online softmax
- RoPE (Rotary Positional Embedding)
- MoE TopK routing
#include "05_attention/flash_attention.cuh"
hpc::attention::FlashAttnConfig config{
.batch_size = batch,
.num_heads = heads,
.seq_len = seq_len,
.head_dim = 64,
.scale = 1.0f / std::sqrt(64.0f),
.causal = true
};
hpc::attention::flash_attention_forward<float>(
d_Q, d_K, d_V, d_O, config, stream);- INT8 per-row quantization/dequantization
- FP8 scaling utilities (placeholder for future Hopper support)
Note: These modules provide educational examples and fallback implementations. Full Hopper feature support requires SM 9.0+.
- TMA (Tensor Memory Accelerator) - async copy fallback
- Thread Block Clusters - portable reduction fallback
- FP8 GEMM - scaled FP16 demonstration
The project uses a two-tier testing strategy:
# Run all tests
ctest --test-dir build --output-on-failure
# Run specific test suite
./build/tests/gemm/test_gemmProperty-based tests automatically generate test cases to find edge cases:
RC_GTEST_PROP(GemmTest, Correctness, ()) {
auto M = *rc::gen::inRange<int>(1, 64);
auto N = *rc::gen::inRange<int>(1, 64);
auto K = *rc::gen::inRange<int>(1, 64);
// ... automatically verifies correctness for all combinations
}| Document | Description | Difficulty |
|---|---|---|
| GEMM Optimization | 7-step matrix optimization journey | ⭐⭐⭐⭐ |
| Memory Optimization | Coalesced access, vectorization, shared memory | ⭐⭐ |
| Reduction Optimization | Warp shuffle, online algorithms | ⭐⭐⭐ |
| FlashAttention | IO-aware attention, tiling, online softmax | ⭐⭐⭐⭐ |
| CUDA 13 Features | Hopper architecture: TMA, Clusters, FP8 | ⭐⭐⭐⭐⭐ |
| API Reference | Complete C++/CUDA/Python API docs | ⭐⭐⭐ |
| Architecture | Design patterns and module organization | ⭐⭐ |
Beginner (1-2 weeks):
└── Memory Optimization → Reduction → GEMM (Steps 1-4)
Intermediate (2-4 weeks):
└── GEMM (Steps 5-7) → FlashAttention
Advanced (ongoing):
└── CUDA 13 Features → CUTLASS source code → Research papers
| CMake Option | Default | Description |
|---|---|---|
BUILD_EXAMPLES |
OFF | Build CUDA and Python examples |
BUILD_PYTHON_BINDINGS |
OFF | Build nanobind Python module |
CMAKE_CUDA_ARCHITECTURES |
native | Target GPU architecture(s) |
# Development build with all features
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DBUILD_EXAMPLES=ON \
-DBUILD_PYTHON_BINDINGS=ON
# Release build for specific GPU
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="80;90" # A100 + H100cd docker
docker-compose up -d
docker exec -it hpc-ai-lab bashWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
CI Scope Note: This repository does not currently provide full native CUDA build-and-test coverage in CI. The CI pipeline focuses on code formatting, consistency checks, and documentation builds. GPU-dependent tests require local execution or self-hosted runners.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make changes and add tests
- Ensure tests pass locally (
ctest --test-dir build --output-on-failure) - Commit with conventional commits (
git commit -m 'feat: add amazing feature') - Push and create a Pull Request
This project is licensed under the Apache License 2.0 - see LICENSE for details.
- NVIDIA CUTLASS - Reference implementations
- FlashAttention - Attention optimization techniques
- How to Optimize a CUDA Matmul Kernel - Excellent tutorial
| Module | FP32 | FP16 | INT8 | Notes |
|---|---|---|---|---|
| Elementwise | ✅ | ✅ | - | All optimization levels |
| Reduction | ✅ | ✅ | - | Online algorithms |
| GEMM | ✅ | ✅ | ✅ | 7-step progression |
| Convolution | ✅ | - | - | Implicit GEMM validated |
| Attention | ✅ | - | - | head_dim=64 only |
| Quantization | ✅ | - | ✅ | Per-row scaling |
| Module | Status | Notes |
|---|---|---|
| Winograd Conv | Fallback | Uses implicit GEMM path |
| TMA | Fallback | Uses async copy instead |
| Thread Block Clusters | Fallback | Uses block reduction |
| FP8 GEMM | Demo | Scaled FP16 behavior |
Happy Learning! 🚀