Extreme Infrastructure for GRPO & Large-Scale Reinforcement Learning.
RL-Kernel is a high-performance, memory-efficient infrastructure for Reinforcement Learning post-training. It eliminates the memory and latency bottlenecks in Large Language Model alignment, This project targets AI infrastructure engineers, algorithm researchers, and enterprise-level large model alignment scenarios, providing specialized kernels for algorithms like GRPO, PPO, and DPO.
1. Operator-Level Train-Inference Consistency The biggest hidden barrier in large-scale RL is the subtle numerical divergence between rollout engines (e.g., vLLM) and training engines (e.g., Megatron/DeepSpeed). RL-Kernel provides mathematically rigorous, fused operators that lock down the computational graph. By guaranteeing absolute numerical consistency and deterministic reduction orders across the entire RL loop, we prevent reward hacking and distribution drift at the operator level.
2. Extreme Memory & Compute Efficiency
We replace naive PyTorch paths—which suffer from prefix_shared_attention and fused_logp). This reduces VRAM consumption by up to 10x, unlocking massive batch sizes for GRPO workloads without triggering Out-Of-Memory (OOM) errors.
RL-Kernel sits strictly at the operator layer, acting as a non-intrusive bridge between high-level alignment orchestration (e.g., vime, slime) and foundational execution engines. We ensure maximum throughput and rigorous numerical parity without modifying upstream framework source code.
Note: RL-Kernel integrates natively into Rollout Engines (vLLM, sglang, LMDeploy) and Training Engines (Megatron, DeepSpeed) via non-intrusive custom operator hooks, powered by underlying CUDA, Triton, and ROCm backends.
RL-Kernel is designed to solve the
By implementing Pre-allocated Chunking, RL-Kernel maintains constant additional VRAM overhead regardless of the group size (
Testbed: NVIDIA A100 80GB | Model: Llama-3-8B | Vocab: 128,256 | SeqLen: 512
| Group Size ( |
TRL (Standard) | PyTorch Native | RL-Kernel (Ours) | Status |
|---|---|---|---|---|
| G = 64 | OOM | 15.66 GB | 16.15 GB | Success |
| G = 128 | OOM | 31.31 GB | 31.80 GB | Success |
| G = 256 | FAILED (OOM) | 62.63 GB | 63.12 GB | Optimized |
Note: RL-Kernel is the only solution that successfully scales G=256 on a single A100 by keeping extra VRAM usage to a constant ~0.5GB.
Integrating FlashInfer fused kernels to accelerate the bottleneck of RL training: the sampling phase.
| Batch Size ( |
Native PyTorch | RL-Kernel (Fused) | Speedup |
|---|---|---|---|
| 32 | 176.79 ms | 1.08 ms | 163x |
| 64 | 10.54 ms | 1.31 ms | 8x |
| 128 | 18.89 ms | 1.86 ms | 10x |
| 256 | 36.23 ms | 2.94 ms | 12x |
Testbed: NVIDIA A100 80GB | Model: Qwen3-30B-A3B | Vocab: 151,936 | dtype: fp16
Model weights consume 56.9 GB — only 23 GB headroom remaining for training computation.
- Zero-Growth Memory Pool: Uses pre-allocated buffers and micro-chunking to prevent VRAM spikes during advantage calculation.
- Fused Sampling Pipeline: Direct integration with FlashInfer and vLLM backends for sub-2ms sampling latency.
- Universal Backend Abstraction: Unified API supporting both NVIDIA (CUDA/FlashInfer) and AMD (ROCm/AITER).
- Post-Training Ready: Drop-in replacement for standard sampling and logprob operators in TRL or DeepSpeed-Chat.
RL-Kernel sits between high-level alignment libraries and low-level GPU kernels, ensuring maximum throughput without sacrificing flexibility.
# Clone the repository
git clone https://github.com/RL-Align/RL-Kernel.git
cd RL-Kernel
# Install core dependencies (CUDA 12.4+ recommended)
pip install -e .Inspired by the kernel designs of vLLM and DeepSpeed. As an active contributor to the AI Infrastructure ecosystem, RL-Kernel aims to push the boundaries of RL efficiency.
Target: Building the most efficient RLHF toolchain for the open-source community.
RL-Kernel builds on the shoulders of excellent open-source projects:
- FlashInfer — We integrate FlashInfer's fused sampling kernels as the NVIDIA backend for our sampling pipeline. The sub-2ms sampling latency results are enabled by FlashInfer's highly optimized CUDA operators.
- vLLM — Inspired by vLLM's kernel design philosophy and hardware-aware scheduling approach.
- DeepSpeed — Inspired by DeepSpeed's approach to memory-efficient training infrastructure.
We are grateful to these teams for their contributions to the open-source AI infrastructure ecosystem.



