Skip to content

Commit 7bdb9f6

Browse files
authored
feat(telemetry): latency histograms for LLM request duration and TTFB (#463) (#782)
* feat(telemetry): latency histograms for LLM request duration and TTFB (#463) Adds request duration and time-to-first-token (TTFB) latency histograms via the plugin pattern established in #653. Includes custom OTel bucket views sized for LLM latencies, backend telemetry field assertions across all backends, and updated dev/published docs. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * fix(telemetry): address PR review feedback on latency histograms - Fix TTFB measurement: capture timestamp on first chunk in both the non-blocking drain and blocking min-chunk loops, not after both loops complete (previously measured time-to-Nth-chunk, not time-to-first) - Add inline comment explaining @@@stream@@@ hard-coded string avoids circular import between core and backends - Remove invalid enable_metrics() calls from plugin docstring examples; metrics are enabled via MELLEA_METRICS_ENABLED env var only - Move _METRICS_PLUGIN_CLASSES to metrics_plugins.py alongside the classes it describes; import it in metrics.py registration block - Change LatencyMetricsPlugin priority from 50 to 51 so plugins have distinct execution order Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * refactor: extract _record_ttfb helper to deduplicate TTFB logic Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
1 parent 679b50f commit 7bdb9f6

19 files changed

Lines changed: 749 additions & 155 deletions

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ mkdir -p .bob && ln -s ../.agents/skills .bob/skills
8888
- Use `...` in `@generative` function bodies
8989
- Prefer primitives over classes
9090
- **Friendly Dependency Errors**: Wraps optional backend imports in `try/except ImportError` with a helpful message (e.g., "Please pip install mellea[hf]"). See `mellea/stdlib/session.py` for examples.
91-
- **Backend telemetry fields**: All backends must populate `mot.usage` (dict with `prompt_tokens`, `completion_tokens`, `total_tokens`), `mot.model` (str), and `mot.provider` (str) in their `post_processing()` method. Metrics are automatically recorded by `TokenMetricsPlugin` — don't add manual `record_token_usage_metrics()` calls.
91+
- **Backend telemetry fields**: All backends must populate `mot.usage` (dict with `prompt_tokens`, `completion_tokens`, `total_tokens`), `mot.model` (str), and `mot.provider` (str) in their `post_processing()` method. `mot.streaming` (bool) and `mot.ttfb_ms` (float | None) are set automatically in `astream()` — backends do not need to set them. Metrics are automatically recorded by `TokenMetricsPlugin` and `LatencyMetricsPlugin` — don't add manual `record_token_usage_metrics()` or `record_request_duration()` calls.
9292

9393
## 6. Commits & Hooks
9494
[Angular format](https://github.com/angular/angular/blob/main/CONTRIBUTING.md#commit): `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `release:`

docs/docs/evaluation-and-observability/metrics.md

Lines changed: 72 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
22
title: "Metrics"
3-
description: "Collect token usage metrics and instrument your own code with OpenTelemetry counters, histograms, and up-down counters."
3+
description: "Collect token usage and latency metrics, and instrument your own code with OpenTelemetry counters, histograms, and up-down counters."
44
# diataxis: how-to
55
---
66

77
**Prerequisites:** [Telemetry](../evaluation-and-observability/telemetry)
88
introduces the environment variables and telemetry architecture. This page
99
covers metrics collection in detail.
1010

11-
Mellea automatically tracks token consumption across all backends using
12-
OpenTelemetry metrics counters. Token metrics follow the
11+
Mellea automatically tracks token consumption and request latency across all
12+
backends using OpenTelemetry metrics. Metrics follow the
1313
[Gen-AI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
1414
for standardized observability. The metrics API also lets you create your own
1515
counters, histograms, and up-down counters for application-level instrumentation.
@@ -82,6 +82,57 @@ await mot.astream()
8282
await mot.avalue()
8383
```
8484

85+
## Latency histograms
86+
87+
Mellea tracks request duration and time-to-first-token (TTFB) automatically
88+
after each LLM call. The `LatencyMetricsPlugin` auto-registers when
89+
`MELLEA_METRICS_ENABLED=true` alongside `TokenMetricsPlugin`. No code changes
90+
are required.
91+
92+
### Latency instruments
93+
94+
| Metric Name | Type | Unit | Description |
95+
| ----------- | ---- | ---- | ----------- |
96+
| `mellea.llm.request.duration` | Histogram | `s` | Total request duration, from call to full response |
97+
| `mellea.llm.ttfb` | Histogram | `s` | Time to first token (streaming requests only) |
98+
99+
### Latency attributes
100+
101+
| Attribute | Description | Example Values |
102+
| --------- | ----------- | -------------- |
103+
| `gen_ai.provider.name` | Backend provider name | `openai`, `ollama`, `watsonx`, `litellm`, `huggingface` |
104+
| `gen_ai.request.model` | Model identifier | `gpt-4`, `llama3.2:7b`, `granite-3.1-8b-instruct` |
105+
| `streaming` | Whether streaming mode was used (duration only) | `True`, `False` |
106+
107+
### Histogram buckets
108+
109+
Custom bucket boundaries are configured for LLM-sized latencies:
110+
111+
- **`mellea.llm.request.duration`**: `0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120` seconds
112+
- **`mellea.llm.ttfb`**: `0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10` seconds
113+
114+
### Latency recording timing
115+
116+
- **`mellea.llm.request.duration`**: Recorded for every `generate_from_context` call,
117+
both streaming and non-streaming.
118+
- **`mellea.llm.ttfb`**: Recorded only for streaming requests, measuring elapsed time
119+
from the `generate_from_context` call until the first chunk arrives.
120+
121+
Access latency data directly from a `ModelOutputThunk`:
122+
123+
```python
124+
from mellea import start_session
125+
from mellea.backends import ModelOption
126+
127+
with start_session() as m:
128+
result = m.instruct(
129+
"Explain quantum entanglement briefly",
130+
model_options={ModelOption.STREAM: True},
131+
)
132+
if result.streaming and result.ttfb_ms is not None:
133+
print(f"Time to first token: {result.ttfb_ms:.1f} ms")
134+
```
135+
85136
## Metrics export configuration
86137

87138
Mellea supports multiple metrics exporters that can be used independently or
@@ -283,13 +334,14 @@ Check if metrics are enabled:
283334
from mellea.telemetry import is_metrics_enabled
284335

285336
if is_metrics_enabled():
286-
print("Token metrics are being collected")
337+
print("Metrics are being collected")
287338
```
288339

289-
Access token usage data from a `ModelOutputThunk`:
340+
Access token usage and latency data from a `ModelOutputThunk`:
290341

291342
```python
292343
from mellea import start_session
344+
from mellea.backends import ModelOption
293345

294346
with start_session() as m:
295347
result = m.instruct("Write a haiku about programming")
@@ -298,18 +350,29 @@ with start_session() as m:
298350
print(f"Prompt tokens: {result.usage['prompt_tokens']}")
299351
print(f"Completion tokens: {result.usage['completion_tokens']}")
300352
print(f"Total tokens: {result.usage['total_tokens']}")
353+
354+
# Streaming mode also exposes time-to-first-token
355+
streamed = m.instruct(
356+
"Describe the solar system",
357+
model_options={ModelOption.STREAM: True},
358+
)
359+
print(f"Streaming: {streamed.streaming}")
360+
if streamed.ttfb_ms is not None:
361+
print(f"Time to first token: {streamed.ttfb_ms:.1f} ms")
301362
```
302363

303364
The `usage` field is a dictionary with three keys: `prompt_tokens`,
304365
`completion_tokens`, and `total_tokens`. All backends populate this field
305-
consistently.
366+
consistently. `streaming` and `ttfb_ms` are set automatically based on whether
367+
streaming mode was used.
306368

307369
## Performance
308370

309371
- **Zero overhead when disabled**: When `MELLEA_METRICS_ENABLED=false` (default),
310-
the `TokenMetricsPlugin` is not registered and all instrument calls are no-ops.
311-
- **Minimal overhead when enabled**: Counter increments are extremely fast
312-
(~nanoseconds per operation).
372+
neither `TokenMetricsPlugin` nor `LatencyMetricsPlugin` is registered and all
373+
instrument calls are no-ops.
374+
- **Minimal overhead when enabled**: Counter increments and histogram recordings
375+
are extremely fast (~nanoseconds per operation).
313376
- **Async export**: Metrics are batched and exported asynchronously (default:
314377
every 60 seconds).
315378
- **Non-blocking**: Metric recording never blocks LLM calls.

docs/docs/evaluation-and-observability/telemetry.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -116,11 +116,14 @@ exporter configuration (Jaeger, Grafana Tempo, etc.), and debugging guidance.
116116

117117
## Metrics
118118

119-
Mellea automatically tracks token consumption across all backends using
120-
OpenTelemetry counters (`mellea.llm.tokens.input` and
121-
`mellea.llm.tokens.output`). No code changes are required — the
122-
`TokenMetricsPlugin` records metrics via the plugin hook system after each
123-
LLM call completes.
119+
Mellea automatically tracks token consumption and request latency across all
120+
backends using OpenTelemetry metrics. No code changes are required — two
121+
plugins hook into the generation pipeline and record metrics automatically:
122+
123+
- **`TokenMetricsPlugin`** — records `mellea.llm.tokens.input` and
124+
`mellea.llm.tokens.output` counters after each LLM call.
125+
- **`LatencyMetricsPlugin`** — records `mellea.llm.request.duration` (every
126+
request) and `mellea.llm.ttfb` (streaming requests only) histograms.
124127

125128
The metrics API also exposes `create_counter`, `create_histogram`, and
126129
`create_up_down_counter` for instrumenting your own application code.
@@ -131,8 +134,9 @@ Mellea supports three exporters that can run simultaneously:
131134
- **OTLP** — export to production observability platforms
132135
- **Prometheus** — register with `prometheus_client` for scraping
133136

134-
See [Metrics](../evaluation-and-observability/metrics) for token usage details,
135-
backend support matrix, exporter setup, custom instruments, and troubleshooting.
137+
See [Metrics](../evaluation-and-observability/metrics) for token usage and
138+
latency details, backend support matrix, exporter setup, custom instruments,
139+
and troubleshooting.
136140

137141
## Logging
138142

docs/examples/telemetry/metrics_example.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
"""Example demonstrating OpenTelemetry metrics exporters in Mellea.
44
5-
This example shows how to use token usage metrics with different exporters:
5+
This example shows how to use metrics with different exporters:
66
- Console: Print metrics to console for debugging
77
- OTLP: Export to OpenTelemetry Protocol collectors
88
- Prometheus: Expose HTTP endpoint for Prometheus scraping
@@ -42,6 +42,7 @@
4242
import os
4343

4444
from mellea import generative, start_session
45+
from mellea.backends import ModelOption
4546
from mellea.stdlib.requirements import req
4647

4748

@@ -58,7 +59,7 @@ def translate_to_spanish(text: str) -> str:
5859
def main():
5960
"""Run example with metrics collection."""
6061
print("=" * 60)
61-
print("Mellea Token Metrics Example")
62+
print("Mellea Metrics Example")
6263
print("=" * 60)
6364

6465
# Check if metrics are enabled
@@ -70,7 +71,7 @@ def main():
7071
print("=" * 60)
7172
return
7273

73-
print("✓ Token metrics enabled")
74+
print("Metrics enabled")
7475

7576
# When Prometheus is enabled, start an HTTP server to expose metrics
7677
if os.getenv("MELLEA_METRICS_PROMETHEUS", "false").lower() == "true":
@@ -118,8 +119,18 @@ def main():
118119
response = m.chat("What is the capital of France?")
119120
print(f"Response: {str(response)[:100]}...")
120121

122+
# Example 5: Streaming with latency metrics
123+
print("\n5. Streaming request (latency metrics)...")
124+
streamed = m.instruct(
125+
"Name three programming languages in one sentence.",
126+
model_options={ModelOption.STREAM: True},
127+
)
128+
print(f"Response: {str(streamed)[:100]}...")
129+
if streamed.streaming and streamed.ttfb_ms is not None:
130+
print(f" -> Time to first token: {streamed.ttfb_ms:.1f} ms")
131+
121132
print("\n" + "=" * 60)
122-
print("Example complete! Token metrics recorded.")
133+
print("Example complete! Metrics recorded.")
123134

124135
# When Prometheus is enabled, keep the process running so the endpoint can be scraped
125136
if os.getenv("MELLEA_METRICS_PROMETHEUS", "false").lower() == "true":

mellea/core/base.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,19 @@ def __init__(
306306
Populated by backends. None if unavailable.
307307
"""
308308

309+
self.ttfb_ms: float | None = None
310+
"""Time to first token in milliseconds (streaming only).
311+
312+
Set when the first chunk is received from the backend.
313+
None for non-streaming requests or when not measured.
314+
"""
315+
316+
self.streaming: bool = False
317+
"""Whether this generation used streaming mode.
318+
319+
Set from model options at the start of astream().
320+
"""
321+
309322
# Used for tracking generation.
310323
self._context: list[Component | CBlock] | None = None
311324
self._action: Component | CBlock | None = None
@@ -328,8 +341,21 @@ def __init__(
328341
self._on_computed: Callable[[ModelOutputThunk], Coroutine] | None = None
329342

330343
self._start: datetime.datetime | None = None
344+
self._first_chunk_received: bool = False
331345
self._generate_log: GenerateLog | None = None
332346

347+
def _record_ttfb(self) -> None:
348+
"""Record time-to-first-byte if streaming and not yet recorded."""
349+
if (
350+
self.streaming
351+
and not self._first_chunk_received
352+
and self._start is not None
353+
):
354+
self.ttfb_ms = (
355+
datetime.datetime.now() - self._start
356+
).total_seconds() * 1000
357+
self._first_chunk_received = True
358+
333359
def _copy_from(self, other: ModelOutputThunk) -> None:
334360
"""Copy computed-output fields from *other* into *self*.
335361
@@ -345,6 +371,8 @@ def _copy_from(self, other: ModelOutputThunk) -> None:
345371
self.usage = other.usage
346372
self.model = other.model
347373
self.provider = other.provider
374+
self.ttfb_ms = other.ttfb_ms
375+
self.streaming = other.streaming
348376
self._generate_log = other._generate_log
349377

350378
def is_computed(self) -> bool:
@@ -418,6 +446,9 @@ async def astream(self) -> str:
418446
)
419447

420448
do_set_computed = False
449+
# Use string directly to avoid importing ModelOption from backends into core (circular import).
450+
# ModelOption.STREAM is defined in mellea/backends/model_options.py.
451+
self.streaming = bool((self._model_options or {}).get("@@@stream@@@", False))
421452

422453
if not self._generate_type == GenerateType.ASYNC:
423454
raise RuntimeError(
@@ -434,6 +465,7 @@ async def astream(self) -> str:
434465
try:
435466
item = self._async_queue.get_nowait()
436467
chunks.append(item)
468+
self._record_ttfb()
437469
except asyncio.QueueEmpty:
438470
# We've exhausted the current items in the queue.
439471
break
@@ -449,6 +481,7 @@ async def astream(self) -> str:
449481

450482
item = await self._async_queue.get()
451483
chunks.append(item)
484+
self._record_ttfb()
452485

453486
# Process the sentinel value if it's there.
454487
if chunks[-1] is None:
@@ -561,6 +594,8 @@ def __copy__(self) -> ModelOutputThunk:
561594
copied.usage = self.usage
562595
copied.model = self.model
563596
copied.provider = self.provider
597+
copied.ttfb_ms = self.ttfb_ms
598+
copied.streaming = self.streaming
564599
return copied
565600

566601
def __deepcopy__(self, memo: dict) -> ModelOutputThunk:
@@ -593,6 +628,8 @@ def __deepcopy__(self, memo: dict) -> ModelOutputThunk:
593628
deepcopied.usage = deepcopy(self.usage) if self.usage else None
594629
deepcopied.model = self.model
595630
deepcopied.provider = self.provider
631+
deepcopied.ttfb_ms = self.ttfb_ms
632+
deepcopied.streaming = self.streaming
596633
return deepcopied
597634

598635

mellea/telemetry/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,9 @@ def my_function():
6767
create_histogram,
6868
create_up_down_counter,
6969
is_metrics_enabled,
70+
record_request_duration,
7071
record_token_usage_metrics,
72+
record_ttfb,
7173
)
7274
from .tracing import (
7375
end_backend_span,
@@ -89,7 +91,9 @@ def my_function():
8991
"is_application_tracing_enabled",
9092
"is_backend_tracing_enabled",
9193
"is_metrics_enabled",
94+
"record_request_duration",
9295
"record_token_usage_metrics",
96+
"record_ttfb",
9397
"set_span_attribute",
9498
"set_span_error",
9599
"start_backend_span",

0 commit comments

Comments
 (0)