Contributing Guide | 贡献指南

Thank you for your interest in contributing to HPC-AI-Optimization-Lab! This document provides guidelines and instructions for contributing.

感谢您对 HPC-AI-Optimization-Lab 的关注！本文档提供贡献指南和说明。

Table of Contents | 目录

Code of Conduct
Getting Started
Development Setup
Code Style
Testing Guidelines
Commit Guidelines
Pull Request Process

Code of Conduct

This project adheres to the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to the project maintainers.

Getting Started

Ways to Contribute

Type	Description
🐛 Bug Reports	Submit issues with reproducible examples
💡 Feature Requests	Propose new kernels or optimizations
📝 Documentation	Improve guides, add examples
🔧 Code	Implement features, fix bugs
🧪 Tests	Add test coverage, improve test quality

Before You Start

Check existing issues to avoid duplicates
For major changes, open a discussion issue first
Ensure you have a CUDA-capable GPU and development environment

Development Setup

System Requirements

Requirement	Minimum	Recommended
CUDA Toolkit	12.4	12.4+
CMake	3.24	3.28+
C++ Compiler	GCC 11 / Clang 14	GCC 12+ / Clang 15+
GPU	SM 7.0 (Volta)	SM 8.0+ (Ampere)
Python (optional)	3.8	3.10+

Initial Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab

# Create development branch
git checkout -b feature/your-feature

# Configure with debug symbols
cmake -S . -B build \
    -DCMAKE_BUILD_TYPE=Debug \
    -DBUILD_EXAMPLES=ON \
    -DBUILD_PYTHON_BINDINGS=ON

# Build
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure

Docker Development

cd docker
docker-compose up -d
docker exec -it hpc-ai-lab bash

Pre-commit Hooks

pip install pre-commit
pre-commit install

This runs automatic checks before each commit:

Code formatting (clang-format)
Static analysis (clang-tidy)
Trailing whitespace, file endings

Code Style

C++/CUDA Guidelines

Naming Conventions

Type	Convention	Example
Namespaces	`snake_case`	`hpc::elementwise`
Classes/Structs	`PascalCase`	`Tensor`, `CudaTimer`
Functions	`snake_case`	`copy_from_host`, `relu`
Variables	`snake_case`	`block_size`, `num_threads`
Constants	`UPPER_SNAKE_CASE`	`TILE_SIZE`, `WARP_SIZE`
Macros	`UPPER_SNAKE_CASE`	`CUDA_CHECK`
Template Parameters	`PascalCase`	`typename T`, `int BlockSize`

File Organization

// kernel_name.cuh - Header file
#pragma once

#include <cuda_runtime.h>
#include <concepts>

namespace hpc::module {

// Public interface only
template <typename T, OptLevel Level = OptLevel::Default>
void kernel_name(const T* input, T* output, size_t n,
                 cudaStream_t stream = nullptr);

} // namespace hpc::module

// kernel_name.cu - Implementation file
#include "kernel_name.cuh"
#include "../common/cuda_check.cuh"

namespace hpc::module {

// Anonymous namespace for private kernels
namespace {

template <typename T>
__global__ void kernel_impl(const T* __restrict__ input,
                             T* __restrict__ output, size_t n) {
    // Implementation
}

} // anonymous namespace

// Explicit specializations
template <>
void kernel_name<float, OptLevel::Default>(...) {
    kernel_impl<float><<<grid, block, 0, stream>>>(...);
    CUDA_CHECK_LAST();
}

} // namespace hpc::module

CUDA Best Practices

// ✓ Good: Use restrict qualifier
__global__ void kernel(const float* __restrict__ input,
                        float* __restrict__ output) { }

// ✓ Good: Use constexpr for compile-time constants
constexpr int TILE_SIZE = 32;

// ✓ Good: Use __forceinline__ for small device functions
__device__ __forceinline__ float warp_reduce_sum(float val) { }

// ✓ Good: Use appropriate memory spaces
__shared__ float tile[TILE_SIZE][TILE_SIZE];  // Block-level sharing
__constant__ float params[256];                 // Read-only, broadcast

// ✗ Bad: Mutable global state
__device__ int global_counter;  // Avoid if possible

Modern C++ Features

// Use concepts for template constraints
template <typename T>
    requires std::is_same_v<T, float> || std::is_same_v<T, __half>
void kernel_function(const T* input, T* output);

// Use RAII for resource management
hpc::Tensor<float> data(n);  // Automatic cleanup

// Use [[nodiscard]] for important return values
[[nodiscard]] bool almost_equal(float a, float b);

// Use constexpr where possible
constexpr size_t calculate_tile_size(int smem_per_block) {
    return smem_per_block / sizeof(float);
}

Python Style

Follow PEP 8 with these additions:

# Keep bindings thin - just validation and CUDA calls
def relu(input: torch.Tensor, output: torch.Tensor) -> None:
    """
    Apply ReLU activation.
    
    Args:
        input: Input tensor (CUDA, float32)
        output: Output tensor (CUDA, float32, same shape as input)
    
    Raises:
        ValueError: If tensors are not CUDA tensors or shapes mismatch
    """
    if not input.is_cuda or not output.is_cuda:
        raise ValueError("Tensors must be on CUDA device")
    if input.shape != output.shape:
        raise ValueError("Input and output shapes must match")
    
    _cuda_relu(input, output)  # Call C++ binding

Testing Guidelines

Test Structure

tests/
├── test_utils.hpp          # Shared test utilities
├── module_name/
│   └── test_kernel.cpp     # Kernel-specific tests

Unit Tests

#include <gtest/gtest.h>
#include "elementwise/relu.cuh"
#include "common/tensor.cuh"
#include "../test_utils.hpp"

TEST(ReluTest, BasicCorrectness) {
    // Arrange
    std::vector<float> input = {-1.0f, 0.0f, 1.0f, 2.0f};
    std::vector<float> expected = {0.0f, 0.0f, 1.0f, 2.0f};
    
    hpc::Tensor<float> d_input(input.size());
    hpc::Tensor<float> d_output(input.size());
    d_input.copy_from_host(input);
    
    // Act
    hpc::elementwise::relu<float>(d_input.data(), d_output.data(), input.size());
    cudaDeviceSynchronize();
    
    // Assert
    auto result = d_output.to_host();
    EXPECT_TRUE(hpc::test::vectors_almost_equal(result, expected));
}

TEST(ReluTest, HandlesEmptyInput) {
    hpc::Tensor<float> d_input(0);
    hpc::Tensor<float> d_output(0);
    
    EXPECT_NO_THROW(
        hpc::elementwise::relu<float>(d_input.data(), d_output.data(), 0)
    );
}

Property-Based Tests

#include <rapidcheck/gtest.h>

RC_GTEST_PROP(ReluTest, Idempotent, ()) {
    // Property: relu(relu(x)) == relu(x)
    auto size = *rc::gen::inRange<size_t>(1, 1024);
    auto input = *rc::gen::container<std::vector<float>>(size, 
        rc::gen::map(rc::gen::arbitrary<float>(), [](float x) {
            return std::clamp(x, -10.0f, 10.0f);
        }));
    
    hpc::Tensor<float> d_input(size);
    hpc::Tensor<float> d_output1(size);
    hpc::Tensor<float> d_output2(size);
    
    d_input.copy_from_host(input);
    
    hpc::elementwise::relu<float>(d_input.data(), d_output1.data(), size);
    hpc::elementwise::relu<float>(d_output1.data(), d_output2.data(), size);
    cudaDeviceSynchronize();
    
    auto result1 = d_output1.to_host();
    auto result2 = d_output2.to_host();
    
    RC_ASSERT(result1 == result2);
}

Performance Benchmarks

TEST(GemmTest, PerformanceBenchmark) {
    constexpr int M = 1024, N = 1024, K = 1024;
    
    hpc::Tensor<float> A(M * K), B(K * N), C(M * N);
    
    hpc::CudaTimer timer;
    constexpr int WARMUP = 10;
    constexpr int ITERATIONS = 100;
    
    // Warmup
    for (int i = 0; i < WARMUP; ++i) {
        hpc::gemm::gemm<float>(A.data(), B.data(), C.data(), M, N, K);
    }
    cudaDeviceSynchronize();
    
    // Benchmark
    timer.start();
    for (int i = 0; i < ITERATIONS; ++i) {
        hpc::gemm::gemm<float>(A.data(), B.data(), C.data(), M, N, K);
    }
    timer.stop();
    
    float avg_time_ms = timer.elapsed_ms() / ITERATIONS;
    float tflops = 2.0 * M * N * K / (avg_time_ms * 1e9);
    
    std::cout << "Average time: " << avg_time_ms << " ms\n";
    std::cout << "Performance: " << tflops << " TFLOPS\n";
}

Commit Guidelines

Conventional Commits

We follow Conventional Commits:

<type>(<scope>): <description>

[optional body]

[optional footer]

Types

Type	Description	Example
`feat`	New feature	`feat(gemm): add FP8 support`
`fix`	Bug fix	`fix(softmax): handle empty input`
`docs`	Documentation	`docs: update README installation steps`
`style`	Formatting	`style: fix clang-format issues`
`refactor`	Code restructuring	`refactor(tensor): use RAII pattern`
`perf`	Performance	`perf(gemm): optimize register usage`
`test`	Tests	`test(attention): add property-based tests`
`chore`	Build/tooling	`chore: update CUDA version in Docker`

Examples

# Good commits
git commit -m "feat(gemm): add Tensor Core WMMA implementation"
git commit -m "fix(softmax): resolve numerical overflow for large inputs"
git commit -m "docs: add FlashAttention optimization guide"

# With body
git commit -m "perf(gemm): implement double buffering" << 'EOF'
- Add double buffer shared memory
- Overlap memory load with computation
- 1.75x speedup on A100

Benchmarks: M=N=K=4096, time reduced from 2.0ms to 1.14ms
EOF

Pull Request Process

Before Submitting

Code compiles without warnings
All tests pass locally (ctest --test-dir build --output-on-failure)
New code has test coverage
Documentation updated if needed
Commit messages follow conventional commits
Branch is up to date with main

PR Template

## Description
Brief description of changes.

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Documentation update
- [ ] Performance improvement
- [ ] Refactoring

## Testing
Describe tests added/modified.

## Checklist
- [ ] Tests pass locally
- [ ] Documentation updated
- [ ] No compiler warnings

Review Process

Automated Checks: CI runs formatting, linting, documentation builds
Code Review: At least one maintainer reviews
Testing: Reviewer verifies functionality on GPU machine if needed
Approval: Maintainer approves and merges

After Merge

Delete your feature branch
Update your local main branch
Look for your contribution in the next release notes

Questions?

Open a Discussion for questions
Check existing issues before creating new ones
Review documentation for technical details

Thank you for contributing! 🎉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing Guide | 贡献指南

Table of Contents | 目录

Code of Conduct

Getting Started

Ways to Contribute

Before You Start

Development Setup

System Requirements

Initial Setup

Docker Development

Pre-commit Hooks

Code Style

C++/CUDA Guidelines

Naming Conventions

File Organization

CUDA Best Practices

Modern C++ Features

Python Style

Testing Guidelines

Test Structure

Unit Tests

Property-Based Tests

Performance Benchmarks

Commit Guidelines

Conventional Commits

Types

Examples

Pull Request Process

Before Submitting

PR Template

Review Process

After Merge

Questions?

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing Guide | 贡献指南

Table of Contents | 目录

Code of Conduct

Getting Started

Ways to Contribute

Before You Start

Development Setup

System Requirements

Initial Setup

Docker Development

Pre-commit Hooks

Code Style

C++/CUDA Guidelines

Naming Conventions

File Organization

CUDA Best Practices

Modern C++ Features

Python Style

Testing Guidelines

Test Structure

Unit Tests

Property-Based Tests

Performance Benchmarks

Commit Guidelines

Conventional Commits

Types

Examples

Pull Request Process

Before Submitting

PR Template

Review Process

After Merge

Questions?