Motivation
Motivation
LMDeploy already provides PD disaggregation (DistServe) on the PyTorch engine via lmdeploy/pytorch/disagg/, with Mooncake as an optional KV migration backend (MooncakeBackend, using the Python mooncake.engine.TransferEngine).
TurboMind (C++) remains a high-performance path, but it is not integrated with Mooncake today, so Prefill/Decode split deployments cannot reuse TurboMind's paged KV and kernel stack for cross-node KV.
We would like to integrate Mooncake's Transfer Engine (or an equivalent C++ SDK) into TurboMind's C++ layer to:
- Align block lifecycle with SequenceManager / BlockManager (including optional prefix caching and consistent KV quantization layout);
- Asynchronously export KV after prefill on the prefill side, asynchronously pull on the decode side, and overlap transfer with compute;
- Align or stay compatible with Conductor / metadata protocols used on the PyTorch side to avoid diverging semantics.
Goal: enable end-to-end Mooncake-based PD disaggregation on TurboMind with minimal impact on throughput.
Related resources
- In-repo: lmdeploy/pytorch/disagg/backend/mooncake.py, lmdeploy/pytorch/disagg/config.py (MooncakeEngineConfig, MigrationBackend)
- Mooncake: https://github.com/kvcache-ai/Mooncake
- TurboMind pointers: src/turbomind/models/llama/SequenceManager., BlockManager., src/turbomind/engine/engine.cc and request/scheduling code
Additional context
- Current state: Mooncake PD disaggregation is implemented for PyTorch; TurboMind has no Mooncake / disagg integration.
- Challenges: block/chunk mapping between TurboMind and Mooncake, KV sharding under attn_tp / attn_cp, and scheduling/state machine when waiting for remote KV.
- Suggested phases: optional CMake dependency -> metadata/RPC -> memory transfer hooks -> two-machine e2e validation.
One-liner
PyTorch engine supports PD disaggregation and KV migration via Mooncake (and DLSlime); TurboMind does not yet support Mooncake-based PD disaggregation.
Related resources
No response
Additional context
No response
Motivation
Motivation
LMDeploy already provides PD disaggregation (DistServe) on the PyTorch engine via lmdeploy/pytorch/disagg/, with Mooncake as an optional KV migration backend (MooncakeBackend, using the Python mooncake.engine.TransferEngine).
TurboMind (C++) remains a high-performance path, but it is not integrated with Mooncake today, so Prefill/Decode split deployments cannot reuse TurboMind's paged KV and kernel stack for cross-node KV.
We would like to integrate Mooncake's Transfer Engine (or an equivalent C++ SDK) into TurboMind's C++ layer to:
Goal: enable end-to-end Mooncake-based PD disaggregation on TurboMind with minimal impact on throughput.
Related resources
Additional context
One-liner
PyTorch engine supports PD disaggregation and KV migration via Mooncake (and DLSlime); TurboMind does not yet support Mooncake-based PD disaggregation.
Related resources
No response
Additional context
No response