HPC-AI-Optimization-Lab

A Comprehensive CUDA Kernel Optimization Laboratory for AI Workloads

🎯 Overview

HPC-AI-Optimization-Lab is an educational and production-ready CUDA kernel library designed for AI inference workloads. It provides step-by-step optimized implementations of critical GPU operations, from basic elementwise operations to advanced Tensor Core matrix multiplication.

Key Features

📚 Progressive Learning Path: Each module demonstrates optimization techniques from naive to expert-level
🔬 Production Quality: All kernels include comprehensive test coverage (GoogleTest + RapidCheck)
🚀 Modern C++20: Uses concepts, RAII, and contemporary design patterns
🐍 Python Bindings: Optional nanobind-based Python interface for rapid prototyping

📁 Project Structure

hpc-ai-optimization-lab/
├── src/
│   ├── common/           # Shared utilities (Tensor, Timer, CUDA checks)
│   ├── 01_elementwise/   # ReLU, Sigmoid, Vector Add, Transpose
│   ├── 02_reduction/     # Softmax, LayerNorm, RMSNorm
│   ├── 03_gemm/          # 7-step GEMM optimization journey
│   ├── 04_convolution/   # Implicit GEMM, Winograd convolution
│   ├── 05_attention/     # FlashAttention, RoPE, TopK
│   ├── 06_quantization/  # INT8/FP8 quantization utilities
│   └── 07_cuda13_features/ # Experimental Hopper architecture features
├── tests/                # Comprehensive test suite
├── examples/             # CUDA and Python examples
├── python/               # Nanobind Python bindings
└── docs/                 # Technical documentation

🚀 Quick Start

Prerequisites

Requirement	Version
CUDA Toolkit	12.4+
CMake	3.24+
C++ Compiler	GCC 11+ / Clang 14+ / MSVC 2022+
NVIDIA GPU	Compute Capability 7.0+

Build

# Clone repository
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab

# Configure and build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure

Build with Examples

cmake -S . -B build -DBUILD_EXAMPLES=ON
cmake --build build --target relu_example gemm_benchmark

# Run examples
./build/examples/relu_example
./build/examples/gemm_benchmark

Build Python Bindings

cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"

# Verify installation
python -c "import hpc_ai_opt; print('Module loaded successfully!')"

# Run Python example
python examples/python/basic_usage.py

📚 Module Overview

01 - Elementwise Operations

Optimizations covered:

Naive implementation
Vectorized loads/stores (float4)
Grid stride loops for arbitrary input sizes
Shared memory for transpose operations

#include "01_elementwise/relu.cuh"

// Using optimized Grid Stride implementation
hpc::elementwise::relu<float, hpc::elementwise::OptLevel::GridStride>(
    d_input, d_output, n, stream);

02 - Reduction Operations

Optimizations covered:

Warp shuffle primitives
Block-level reduction
Online Softmax algorithm
Welford's algorithm for numerical stability

#include "02_reduction/softmax.cuh"

// Online Softmax - single pass algorithm
hpc::reduction::softmax<float, hpc::reduction::SoftmaxOpt::OnlineSoftmax>(
    d_input, d_output, batch, seq_len, stream);

03 - GEMM (7-Step Optimization)

The flagship module demonstrating progressive GEMM optimization:

Step	Technique	FP32 TFLOPS	Key Insight
1	Naive	~0.5	Baseline - each thread computes one element
2	Shared Memory Tiling	~2.0	Reduce global memory access by TILE_SIZE
3	Double Buffering	~3.5	Hide memory latency with computation overlap
4	Register Tiling	~6.0	Reduce shared memory bank conflicts
5	Tensor Core WMMA	~50+	Hardware-accelerated matrix operations
6	Tensor Core MMA PTX	~60+	Fine-grained Tensor Core control
7	Software Pipelining	~70+	Multi-stage execution overlap

#include "03_gemm/gemm.cuh"

// Using Tensor Core optimization
hpc::gemm::gemm<__half, hpc::gemm::GemmOpt::TensorCoreWMMA>(
    d_A, d_B, d_C, M, N, K, 1.0f, 0.0f, stream);

04 - Convolution

Implicit GEMM convolution (validated, production-ready)
Winograd convolution (3×3 kernels, experimental fallback)

05 - Attention

FlashAttention forward pass with online softmax
RoPE (Rotary Positional Embedding)
MoE TopK routing

#include "05_attention/flash_attention.cuh"

hpc::attention::FlashAttnConfig config{
    .batch_size = batch,
    .num_heads = heads,
    .seq_len = seq_len,
    .head_dim = 64,
    .scale = 1.0f / std::sqrt(64.0f),
    .causal = true
};

hpc::attention::flash_attention_forward<float>(
    d_Q, d_K, d_V, d_O, config, stream);

06 - Quantization

INT8 per-row quantization/dequantization
FP8 scaling utilities (placeholder for future Hopper support)

07 - CUDA 13 Features (Experimental)

Note: These modules provide educational examples and fallback implementations. Full Hopper feature support requires SM 9.0+.

TMA (Tensor Memory Accelerator) - async copy fallback
Thread Block Clusters - portable reduction fallback
FP8 GEMM - scaled FP16 demonstration

🧪 Testing

The project uses a two-tier testing strategy:

Unit Tests (GoogleTest)

# Run all tests
ctest --test-dir build --output-on-failure

# Run specific test suite
./build/tests/gemm/test_gemm

Property-Based Tests (RapidCheck)

Property-based tests automatically generate test cases to find edge cases:

RC_GTEST_PROP(GemmTest, Correctness, ()) {
    auto M = *rc::gen::inRange<int>(1, 64);
    auto N = *rc::gen::inRange<int>(1, 64);
    auto K = *rc::gen::inRange<int>(1, 64);
    // ... automatically verifies correctness for all combinations
}

📖 Documentation

Technical Guides

Document	Description	Difficulty
GEMM Optimization	7-step matrix optimization journey	⭐⭐⭐⭐
Memory Optimization	Coalesced access, vectorization, shared memory	⭐⭐
Reduction Optimization	Warp shuffle, online algorithms	⭐⭐⭐
FlashAttention	IO-aware attention, tiling, online softmax	⭐⭐⭐⭐
CUDA 13 Features	Hopper architecture: TMA, Clusters, FP8	⭐⭐⭐⭐⭐
API Reference	Complete C++/CUDA/Python API docs	⭐⭐⭐
Architecture	Design patterns and module organization	⭐⭐

Recommended Learning Path

Beginner (1-2 weeks):
└── Memory Optimization → Reduction → GEMM (Steps 1-4)

Intermediate (2-4 weeks):
└── GEMM (Steps 5-7) → FlashAttention

Advanced (ongoing):
└── CUDA 13 Features → CUTLASS source code → Research papers

🔧 Configuration Options

CMake Option	Default	Description
`BUILD_EXAMPLES`	OFF	Build CUDA and Python examples
`BUILD_PYTHON_BINDINGS`	OFF	Build nanobind Python module
`CMAKE_CUDA_ARCHITECTURES`	native	Target GPU architecture(s)

Example Configurations

# Development build with all features
cmake -S . -B build \
    -DCMAKE_BUILD_TYPE=Debug \
    -DBUILD_EXAMPLES=ON \
    -DBUILD_PYTHON_BINDINGS=ON

# Release build for specific GPU
cmake -S . -B build \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="80;90"  # A100 + H100

🐳 Docker Environment

cd docker
docker-compose up -d
docker exec -it hpc-ai-lab bash

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

CI Scope Note: This repository does not currently provide full native CUDA build-and-test coverage in CI. The CI pipeline focuses on code formatting, consistency checks, and documentation builds. GPU-dependent tests require local execution or self-hosted runners.

Quick Contribution Steps

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make changes and add tests
Ensure tests pass locally (ctest --test-dir build --output-on-failure)
Commit with conventional commits (git commit -m 'feat: add amazing feature')
Push and create a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see LICENSE for details.

🙏 Acknowledgments

NVIDIA CUTLASS - Reference implementations
FlashAttention - Attention optimization techniques
How to Optimize a CUDA Matmul Kernel - Excellent tutorial

📊 Support Matrix

Stable / Production-Ready

Module	FP32	FP16	INT8	Notes
Elementwise	✅	✅	-	All optimization levels
Reduction	✅	✅	-	Online algorithms
GEMM	✅	✅	✅	7-step progression
Convolution	✅	-	-	Implicit GEMM validated
Attention	✅	-	-	head_dim=64 only
Quantization	✅	-	✅	Per-row scaling

Experimental / Fallback

Module	Status	Notes
Winograd Conv	Fallback	Uses implicit GEMM path
TMA	Fallback	Uses async copy instead
Thread Block Clusters	Fallback	Uses block reduction
FP8 GEMM	Demo	Scaled FP16 behavior

Happy Learning! 🚀

Report Bug · Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
.kiro/specs		.kiro/specs
changelog		changelog
docker		docker
docs		docs
examples		examples
python		python
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md

Folders and files

Latest commit

History

Repository files navigation

HPC-AI-Optimization-Lab

🎯 Overview

Key Features

📁 Project Structure

🚀 Quick Start

Prerequisites

Build

Build with Examples

Build Python Bindings

📚 Module Overview

01 - Elementwise Operations

02 - Reduction Operations

03 - GEMM (7-Step Optimization)

04 - Convolution

05 - Attention

06 - Quantization

07 - CUDA 13 Features (Experimental)

🧪 Testing

Unit Tests (GoogleTest)

Property-Based Tests (RapidCheck)

📖 Documentation

Technical Guides

Recommended Learning Path

🔧 Configuration Options

Example Configurations

🐳 Docker Environment

🤝 Contributing

Quick Contribution Steps

📄 License

🙏 Acknowledgments

📊 Support Matrix

Stable / Production-Ready

Experimental / Fallback

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages