DRAFT: feat(power): vendor-agnostic power/energy monitoring by viraatc · Pull Request #383 · mlcommons/endpoints

viraatc · 2026-06-30T21:06:23Z

What

First-class, opt-in, vendor-neutral power/energy monitoring. When settings.power.source is set, a best-effort sidecar streams telemetry during the performance phase; at finalization the trace is windowed, integrated into energy, and written to a sibling power.json (+ a report.txt section). Report / result_summary.json stay snapshot-pure — mirroring the profiling precedent.

Design: general framework + pluggable sources

The core PowerConfig is vendor-neutral — it knows nothing about GPUs/Prometheus. A source is a tiny registered builder (@power_source("name")) that turns config into a sidecar argv + parse spec. nvidia_smi is just the first reference source; all NVIDIA-specific bits live in its builder, not the schema.

from inference_endpoint.power import power_source, ResolvedSource

@power_source("redfish")                     # users add their own, no core edits
def _build(cfg):
    return ResolvedSource(argv=["redfish-power", cfg.options["bmc"]], fmt="jsonl",
                          value_kind="power_w", ts_field="ts", value_field="value",
                          label_field="label", csv_header=False)

Minimal exposed config: core = source, interval_s, options (+ 3 hidden advanced). Per-source settings live in options (nvidia_smi: {gpu_indices}, prometheus: {url, query}, command: {argv}); each builder owns + validates its own keys.
Built-ins: nvidia_smi, prometheus (DCGM exporter for remote GPUs), and command (any program emitting canonical JSONL — the process boundary is the plugin API for non-Python users).
Disabled = zero footprint: source=None serializes to {}.

Robustness / correctness

Sidecar via plain subprocess.Popen (not ServiceLauncher, which aborts the run on no-ready-signal); own process group, SIGTERM→SIGKILL, PR_SET_PDEATHSIG backstop, stdbuf line-buffering. Every path swallows errors — a broken collector never fails or perturbs the benchmark.
Windowed to the performance phase via the existing monotonic→wall anchor; trapezoid integration (power_w) or counter-delta (energy_j).
energy_per_output_token_j emitted only when tokens and energy share the same window, else suppressed with a note (no mixed-denominator metric).

Files

New src/inference_endpoint/power/ (sources.py registry, collector.py, parse.py, window.py, render.py, prom_poll.py); settings.power in config/schema.py; integration in commands/benchmark/execute.py. Docs: docs/POWER_MONITORING.md; example: examples/11_Power_Monitoring/.

Test plan

tests/unit/power/ — config/serialization, plugin registration, unknown-source error, nvidia/JSONL parsing (+ malformed drops), trapezoid + counter integration, J/token consistency guard, status precedence, end-to-end sidecar. uv run pytest tests/unit/{power,config,commands} → 231 passed; pre-commit run --all-files green.

Follow-ups (v2)

Multi-source totals + scope-aware double-count guards; entry-point auto-discovery (intentionally omitted — the in-process registry + command cover today's needs). Tracked alongside the execute.py refactor in #384.

🤖 Generated with Claude Code

github-actions · 2026-06-30T21:06:34Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces vendor-agnostic power and energy monitoring telemetry to benchmark runs. It launches a sidecar process during the performance phase to stream power samples from built-in sources (nvidia_smi, prometheus) or custom commands, subsequently writing the integrated energy metrics to a sibling power.json file and report.txt. The review feedback highlights several robustness and compatibility issues: potential uncaught exceptions in parse.py when handling malformed JSONL or non-integer CSV fields; a design bug in templates.py where headerless CSV commands fail with a KeyError; an incomplete counter-reset check in window.py that can miss intermediate resets; a Windows compatibility issue in collector.py due to platform-specific os functions; and a potential TypeError in prom_poll.py if a Prometheus series contains a null metric.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

viraatc · 2026-06-30T21:15:06Z

Note: this PR threads power start/stop hooks + a payload builder into the already-large commands/benchmark/execute.py (~1270 lines; _run_benchmark_async ~370). That growth-per-sidecar smell is tracked separately in #384 (refactor execute.py behind a uniform run-hook seam). Keeping this PR additive and behavior-preserving; the decomposition is intentionally deferred to #384.

Add first-class, opt-in power telemetry collected during the performance phase and written to a sibling power.json (+ a report.txt section); the Report stays a pure snapshot-derived struct, mirroring the profiling precedent. Design (council-reviewed): - Sources: nvidia_smi and prometheus built-ins, plus a `command` escape hatch — any program that prints one sample/line plugs in with no core changes. The process boundary is the agnosticism boundary. - Collection: best-effort sidecar subprocess (own process group, SIGTERM->SIGKILL) so it never perturbs the hot path or fails the run. - Windowing: trace sliced to the PERFORMANCE phase via the shared monotonic->wall anchor; trapezoid integration (power_w) or counter delta (energy_j). - energy_per_output_token_j emitted only when token count and energy share the same window, else suppressed with a note (avoids mixed denominators). Config under settings.power; docs in docs/POWER_MONITORING.md; example in examples/11_Power_Monitoring/. Unit tests cover parsing, integration, consistency guard, status precedence, and an end-to-end sidecar. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- parse.py _parse_jsonl: skip non-dict JSONL lines (a list/scalar would raise AttributeError on obj.get, escaping the except). - parse.py _parse_csv: wrap the field-index int() conversions in try/except so a bad mapping degrades to empty rather than raising in finalization. - window.py _energy_j: detect mid-run counter resets by summing non-negative adjacent deltas (last-minus-first missed an intermediate reset). - collector.py _signal_group: guard os.killpg/getpgid with hasattr for platforms without process groups (Windows). - prom_poll.py _series_label: handle a null metric dict (accept None, check first) to avoid TypeError on `key in None`. - test: add mid-run counter-reset regression test. (The templates.py headerless-command-CSV thread is obsolete: command sources are JSONL-only now; CSV is the built-in nvidia-smi source.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viraatc requested a review from a team June 30, 2026 21:06

github-actions Bot requested review from arekay-nv and nvzhihanj June 30, 2026 21:06

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

viraatc mentioned this pull request Jun 30, 2026

Refactor commands/benchmark/execute.py — split the benchmark God-module #384

Open

viraatc force-pushed the feat/power-monitoring branch 5 times, most recently from daa231e to 5fd49ee Compare June 30, 2026 21:57

viraatc force-pushed the feat/power-monitoring branch from 5fd49ee to 8a5b43d Compare June 30, 2026 22:04

viraatc changed the title ~~feat(power): vendor-agnostic power/energy monitoring~~ DRAFT: feat(power): vendor-agnostic power/energy monitoring Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRAFT: feat(power): vendor-agnostic power/energy monitoring#383

DRAFT: feat(power): vendor-agnostic power/energy monitoring#383
viraatc wants to merge 2 commits into
mainfrom
feat/power-monitoring

viraatc commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

viraatc commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Design: general framework + pluggable sources

Robustness / correctness

Files

Test plan

Follow-ups (v2)

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viraatc commented Jun 30, 2026 •

edited

Loading