Skip to content

Commit 3f12b57

Browse files
authored
chore: Update README.md
Signed-off-by: Neelay Shah <neelays@nvidia.com>
1 parent a48ffc5 commit 3f12b57

2 files changed

Lines changed: 63 additions & 6 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ APIs are subject to change.
101101

102102
### Hello World
103103

104-
[Hello World](./runtime/rust/python-wheel/examples/hello_world)
104+
[Hello World](./lib/bindings/python/examples/hello_world)
105105

106106
A basic example demonstrating the rust based runtime and python
107107
bindings.

examples/python_rs/llm/vllm/README.md

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ This example demonstrates how to use Triton Distributed to serve large language
2323

2424
Start required services (etcd and NATS):
2525

26-
Option A: Using [Docker Compose](/runtime/rust/docker-compose.yml) (Recommended)
26+
Option A: Using [Docker Compose](/deploy/docker-compose.yml) (Recommended)
2727
```bash
28-
docker-compose up -d
28+
docker compose -f ./deploy/docker-compose.yml up -d
2929
```
3030

3131
Option B: Manual Setup
@@ -53,7 +53,7 @@ The example is designed to run in a containerized environment using Triton Distr
5353

5454
## Deployment
5555

56-
#### 1. HTTP Server
56+
### 1. HTTP Server
5757

5858
Run the server logging (with debug level logging):
5959
```bash
@@ -66,26 +66,58 @@ Add model to the server:
6666
llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B triton-init.vllm.generate
6767
```
6868

69+
##### Example Output
70+
```
71+
+------------+------------------------------------------+-----------+-----------+----------+
72+
| MODEL TYPE | MODEL NAME | NAMESPACE | COMPONENT | ENDPOINT |
73+
+------------+------------------------------------------+-----------+-----------+----------+
74+
| chat | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | triton-init | vllm | generate |
75+
+------------+------------------------------------------+-----------+-----------+----------+
76+
```
77+
6978
### 2. Workers
7079

7180
#### 2.1. Monolithic Deployment
7281

7382
In a separate terminal run the vllm worker:
7483

7584
```bash
85+
# Activate virtual environment
86+
source /opt/triton/venv/bin/activate
87+
7688
# Launch worker
7789
cd /workspace/examples/python_rs/llm/vllm
7890
python3 -m monolith.worker \
7991
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
8092
--enforce-eager
8193
```
8294

95+
##### Example Output
96+
97+
```
98+
INFO 03-02 05:30:36 __init__.py:190] Automatically detected platform cuda.
99+
WARNING 03-02 05:30:36 nixl.py:43] NIXL is not available
100+
101+
INFO 03-02 05:30:43 config.py:542] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
102+
INFO 03-02 05:30:43 base_engine.py:43] Initializing engine client
103+
INFO 03-02 05:30:43 api_server.py:206] Started engine process with PID 1151
104+
INFO 03-02 05:30:44 config.py:542] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
105+
106+
<SNIP>
107+
108+
INFO 03-02 05:32:20 llm_engine.py:476] init engine (profile, create kv cache, warmup model) took 4.22 seconds
109+
110+
```
111+
83112
#### 2.2. Disaggregated Deployment
84113

85114
This deployment option splits the model serving across prefill and decode workers, enabling more efficient resource utilization.
86115

87116
**Terminal 1 - Prefill Worker:**
88117
```bash
118+
# Activate virtual environment
119+
source /opt/triton/venv/bin/activate
120+
89121
# Launch prefill worker
90122
cd /workspace/examples/python_rs/llm/vllm
91123
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
@@ -97,8 +129,21 @@ VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python3 -m disaggregat
97129
'{"kv_connector":"TritonNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}'
98130
```
99131

132+
##### Example Output
133+
134+
```
135+
INFO 03-02 05:59:44 worker.py:269] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.40) = 19.01GiB
136+
INFO 03-02 05:59:44 worker.py:269] model weights take 14.99GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.78GiB.
137+
INFO 03-02 05:59:44 executor_base.py:110] # CUDA blocks: 1423, # CPU blocks: 2048
138+
INFO 03-02 05:59:44 executor_base.py:115] Maximum concurrency for 10 tokens per request: 2276.80x
139+
INFO 03-02 05:59:47 llm_engine.py:476] init engine (profile, create kv cache, warmup model) took 3.41 seconds
140+
```
141+
100142
**Terminal 2 - Decode Worker:**
101143
```bash
144+
# Activate virtual environment
145+
source /opt/triton/venv/bin/activate
146+
102147
# Launch decode worker
103148
cd /workspace/examples/python_rs/llm/vllm
104149
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=1,2 python3 -m disaggregated.decode_worker \
@@ -112,6 +157,16 @@ VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=1,2 python3 -m disaggreg
112157

113158
The disaggregated deployment utilizes separate GPUs for prefill and decode operations, allowing for optimized resource allocation and improved performance. For more details on the disaggregated deployment, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/features/disagg_prefill.html).
114159

160+
##### Example Output
161+
162+
```
163+
INFO 03-02 05:59:44 worker.py:269] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.40) = 19.01GiB
164+
INFO 03-02 05:59:44 worker.py:269] model weights take 14.99GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.78GiB.
165+
INFO 03-02 05:59:44 executor_base.py:110] # CUDA blocks: 1423, # CPU blocks: 2048
166+
INFO 03-02 05:59:44 executor_base.py:115] Maximum concurrency for 10 tokens per request: 2276.80x
167+
INFO 03-02 05:59:47 llm_engine.py:476] init engine (profile, create kv cache, warmup model) took 3.41 seconds
168+
```
169+
115170

116171
### 3. Client
117172

@@ -126,7 +181,8 @@ curl localhost:8080/v1/chat/completions \
126181
}'
127182
```
128183

129-
Expected output:
184+
##### Example Output
185+
130186
```json
131187
{
132188
"id": "5b04e7b0-0dcd-4c45-baa0-1d03d924010c",
@@ -264,7 +320,8 @@ curl localhost:8080/v1/chat/completions -H "Content-Type: application/json"
264320
"max_tokens": 30
265321
}'
266322
```
267-
Expected output:
323+
##### Example Output
324+
268325
```json
269326
{
270327
"id": "f435d1aa-d423-40a0-a616-00bc428a3e32",

0 commit comments

Comments
 (0)