Skip to content

DRAFT: feat(power): vendor-agnostic power/energy monitoring#383

Open
viraatc wants to merge 2 commits into
mainfrom
feat/power-monitoring
Open

DRAFT: feat(power): vendor-agnostic power/energy monitoring#383
viraatc wants to merge 2 commits into
mainfrom
feat/power-monitoring

Conversation

@viraatc

@viraatc viraatc commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What

First-class, opt-in, vendor-neutral power/energy monitoring. When settings.power.source is set, a best-effort sidecar streams telemetry during the performance phase; at finalization the trace is windowed, integrated into energy, and written to a sibling power.json (+ a report.txt section). Report / result_summary.json stay snapshot-pure — mirroring the profiling precedent.

Design: general framework + pluggable sources

The core PowerConfig is vendor-neutral — it knows nothing about GPUs/Prometheus. A source is a tiny registered builder (@power_source("name")) that turns config into a sidecar argv + parse spec. nvidia_smi is just the first reference source; all NVIDIA-specific bits live in its builder, not the schema.

from inference_endpoint.power import power_source, ResolvedSource

@power_source("redfish")                     # users add their own, no core edits
def _build(cfg):
    return ResolvedSource(argv=["redfish-power", cfg.options["bmc"]], fmt="jsonl",
                          value_kind="power_w", ts_field="ts", value_field="value",
                          label_field="label", csv_header=False)
  • Minimal exposed config: core = source, interval_s, options (+ 3 hidden advanced). Per-source settings live in options (nvidia_smi: {gpu_indices}, prometheus: {url, query}, command: {argv}); each builder owns + validates its own keys.
  • Built-ins: nvidia_smi, prometheus (DCGM exporter for remote GPUs), and command (any program emitting canonical JSONL — the process boundary is the plugin API for non-Python users).
  • Disabled = zero footprint: source=None serializes to {}.

Robustness / correctness

  • Sidecar via plain subprocess.Popen (not ServiceLauncher, which aborts the run on no-ready-signal); own process group, SIGTERM→SIGKILL, PR_SET_PDEATHSIG backstop, stdbuf line-buffering. Every path swallows errors — a broken collector never fails or perturbs the benchmark.
  • Windowed to the performance phase via the existing monotonic→wall anchor; trapezoid integration (power_w) or counter-delta (energy_j).
  • energy_per_output_token_j emitted only when tokens and energy share the same window, else suppressed with a note (no mixed-denominator metric).

Files

New src/inference_endpoint/power/ (sources.py registry, collector.py, parse.py, window.py, render.py, prom_poll.py); settings.power in config/schema.py; integration in commands/benchmark/execute.py. Docs: docs/POWER_MONITORING.md; example: examples/11_Power_Monitoring/.

Test plan

tests/unit/power/ — config/serialization, plugin registration, unknown-source error, nvidia/JSONL parsing (+ malformed drops), trapezoid + counter integration, J/token consistency guard, status precedence, end-to-end sidecar. uv run pytest tests/unit/{power,config,commands}231 passed; pre-commit run --all-files green.

Follow-ups (v2)

Multi-source totals + scope-aware double-count guards; entry-point auto-discovery (intentionally omitted — the in-process registry + command cover today's needs). Tracked alongside the execute.py refactor in #384.

🤖 Generated with Claude Code

@viraatc viraatc requested a review from a team June 30, 2026 21:06
@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj June 30, 2026 21:06
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces vendor-agnostic power and energy monitoring telemetry to benchmark runs. It launches a sidecar process during the performance phase to stream power samples from built-in sources (nvidia_smi, prometheus) or custom commands, subsequently writing the integrated energy metrics to a sibling power.json file and report.txt. The review feedback highlights several robustness and compatibility issues: potential uncaught exceptions in parse.py when handling malformed JSONL or non-integer CSV fields; a design bug in templates.py where headerless CSV commands fail with a KeyError; an incomplete counter-reset check in window.py that can miss intermediate resets; a Windows compatibility issue in collector.py due to platform-specific os functions; and a potential TypeError in prom_poll.py if a Prometheus series contains a null metric.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/power/parse.py
Comment thread src/inference_endpoint/power/templates.py Outdated
Comment thread src/inference_endpoint/power/window.py Outdated
Comment thread src/inference_endpoint/power/collector.py
Comment thread src/inference_endpoint/power/prom_poll.py Outdated
Comment thread src/inference_endpoint/power/parse.py
@viraatc

viraatc commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Note: this PR threads power start/stop hooks + a payload builder into the already-large commands/benchmark/execute.py (~1270 lines; _run_benchmark_async ~370). That growth-per-sidecar smell is tracked separately in #384 (refactor execute.py behind a uniform run-hook seam). Keeping this PR additive and behavior-preserving; the decomposition is intentionally deferred to #384.

@viraatc viraatc force-pushed the feat/power-monitoring branch 5 times, most recently from daa231e to 5fd49ee Compare June 30, 2026 21:57
Add first-class, opt-in power telemetry collected during the performance
phase and written to a sibling power.json (+ a report.txt section); the
Report stays a pure snapshot-derived struct, mirroring the profiling
precedent.

Design (council-reviewed):
- Sources: nvidia_smi and prometheus built-ins, plus a `command` escape
  hatch — any program that prints one sample/line plugs in with no core
  changes. The process boundary is the agnosticism boundary.
- Collection: best-effort sidecar subprocess (own process group,
  SIGTERM->SIGKILL) so it never perturbs the hot path or fails the run.
- Windowing: trace sliced to the PERFORMANCE phase via the shared
  monotonic->wall anchor; trapezoid integration (power_w) or counter delta
  (energy_j).
- energy_per_output_token_j emitted only when token count and energy share
  the same window, else suppressed with a note (avoids mixed denominators).

Config under settings.power; docs in docs/POWER_MONITORING.md; example in
examples/11_Power_Monitoring/. Unit tests cover parsing, integration,
consistency guard, status precedence, and an end-to-end sidecar.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the feat/power-monitoring branch from 5fd49ee to 8a5b43d Compare June 30, 2026 22:04
@viraatc viraatc changed the title feat(power): vendor-agnostic power/energy monitoring DRAFT: feat(power): vendor-agnostic power/energy monitoring Jul 1, 2026
- parse.py _parse_jsonl: skip non-dict JSONL lines (a list/scalar would
  raise AttributeError on obj.get, escaping the except).
- parse.py _parse_csv: wrap the field-index int() conversions in try/except
  so a bad mapping degrades to empty rather than raising in finalization.
- window.py _energy_j: detect mid-run counter resets by summing non-negative
  adjacent deltas (last-minus-first missed an intermediate reset).
- collector.py _signal_group: guard os.killpg/getpgid with hasattr for
  platforms without process groups (Windows).
- prom_poll.py _series_label: handle a null metric dict (accept None, check
  first) to avoid TypeError on `key in None`.
- test: add mid-run counter-reset regression test.

(The templates.py headerless-command-CSV thread is obsolete: command sources
are JSONL-only now; CSV is the built-in nvidia-smi source.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant