tools: add cephtrace_report.py - a visual summary of trace output by taodd · Pull Request #140 · taodd/cephtrace

taodd · 2026-06-13T07:42:29Z

Summary

radostrace and osdtrace stream one line per IO - thousands of rows per capture. The existing analyzers turn that into numeric tables but draw nothing, and you have to know which flags to run. tools/cephtrace_report.py is the zero-flag visual glance: point it at a log (or pipe one in), it auto-detects the tool, and prints bar charts + a log-scale latency histogram so a capture's shape is understandable at sight.

osdtrace (real output, ~116k ops)

where the time goes   (share of total latency, avg 4.1ms/op)
     messenger  ▏                                        0%
         queue  ███▋                                     9%
           osd  ▉                                        2%
     bluestore  ████████████████▌                        41%
     +kv_commit ██████████████▊                          37%  (rocksdb commit)

op latency distribution   p50 4.0ms   p95 9.5ms   p99 11.3ms   max 30.1ms
       <8ms  ████████████████████████████████████████ 39612
      <16ms  ██████████████████▍                      18164

The where the time goes block turns osdtrace's dense per-stage numbers into an instant "it's RocksDB commit, not the network or queue." A per-OSD p95 block appears when >1 OSD is captured, surfacing an outlier directly.

radostrace (real output)

radostrace summary - 353 client ops
    reads      152  ██████████████████████████████▎          43%
    writes     201  ████████████████████████████████████████ 56%

latency distribution   p50 3.5ms   p95 15.6ms   p99 19.4ms   max 19.7ms

Plus per-pool bars and a slowest-acting-set culprit hint.

Design: reuse, don't duplicate

It deliberately does not add a second parser. After reading both existing analyzers end-to-end, this:

reuses analyze_osdtrace_output.parse_line (handles op_r/op_w/subop_r/subop_w, optional peers, optional bluestore details)
reuses analyze_radostrace_output.detect_file_format and the same column layout

so it can't drift from them. It complements them - it's the visual glance; the analyzers remain for the deep-dive numbers (percentile tables, -i stage contribution, iterative host-mapped culprit ranking), and the report links to each.

Quality / CI

Pure Python 3 stdlib, no dependencies.
flake8-clean, pylint 10.00/10. (Filename is snake_case - required by the lint's naming rule and so the pytest can import it - matching every other tool in tools/.)
Offline pytest smoke tests over a small committed osdtrace sample + the existing radostrace samples, covering both formats, the no-rows error path, --help, and run-as-script. These run automatically via the existing tox -e test in the "Run Tests" workflow.
Bold headings only on a TTY, so piping/redirecting stays clean.

Docs

New doc/visualizing-output.md (with real example output), plus pointers from analyze-osdtrace.md, analyze-radostrace.md, and the README.

Verified

Run against real captures from the test VMs (single-OSD squid, 3-OSD tentacle, both space and -o CSV formats); all op types and both formats parse and render correctly.

radostrace and osdtrace stream one line per IO - thousands of rows. The existing analyzers turn that into numeric tables but draw nothing, and require knowing which flags to run. cephtrace_report is the zero-flag visual glance: it auto-detects the tool and prints bar charts plus a log-scale latency histogram so a capture's shape is understandable at sight. For osdtrace the headline is a 'where the time goes' block - the share of total latency per stage (messenger/queue/osd/bluestore, with kv_commit broken out) - which turns the dense per-stage numbers into an instant 'it's RocksDB commit, not the network'. radostrace gets a read/write split, per-pool bars, and a slowest-acting-set culprit hint. It deliberately does NOT add a second parser: the osdtrace path reuses analyze_osdtrace_output.parse_line and the radostrace path reuses analyze_radostrace_output.detect_file_format and the same column layout, so it can't drift from the analyzers. It complements them (and links to them for the deep-dive numbers) rather than replacing them. Reads a file or stdin, and both the space format and the CSV that 'radostrace -o' writes. Pure stdlib; flake8-clean and pylint 10/10 (the snake_case filename is required for both the linter and the pytest import). Adds a committed small osdtrace sample and offline pytest smoke tests that run via the existing tox -e test in CI. Docs: a new visualizing-output.md plus pointers from the two analyzer docs and the README.

Adds a browser-viewable companion to the terminal summary. cephtrace_report.py --html <file> writes a single self-contained .html (no CDN, no server, works offline - attach it to a ticket) with everything the terminal view shows plus: - latency over the capture (p95 per slice, arrival order) - reveals workloads that degrade or run in bursts, which a terminal can't show - hover tooltips on every bar; sortable per-OSD/per-pool and slowest-ops tables; click a group row to filter the histogram to that OSD or pool The HTML/CSS/JS lives in a separate report_template.html (so the Python stays flake8/pylint clean - long JS lines in an embedded string would trip E501). Python precomputes compact aggregates (overall + per-group histograms, per-slice percentiles, group stats, top-50 slowest) and inlines them as JSON; all rendering is vanilla JS over that payload, so a ~116k-op capture is a ~20KB file. The osdtrace path reuses parse_line and the radostrace path the shared column layout - still no second parser. '<' in the payload is escaped so an object name can't break out of the <script>. Refactor: histogram bucketing factored into _bucketize/_HIST_* shared by the terminal and HTML paths; parsing factored into _load. Terminal output is unchanged. pytest covers HTML generation for both tools (payload parses, self-contained, per-group histograms present, --html requires a filename); flake8-clean, pylint 10/10. Docs updated with rendered screenshots.

taodd · 2026-06-13T08:05:38Z

Added: `--html` self-contained interactive report (Option C)

./tools/cephtrace_report.py --html report.html osd.log now writes a single self-contained .html (no CDN, no server, works offline — attach it to a ticket) alongside the terminal view.

Beyond what the terminal shows:

Latency over the capture — p95 per slice in arrival order, so a workload that degrades or runs in bursts shows up as a shape (a terminal can't do time-series)
Interactive — hover any bar for exact counts/latencies; sortable per-OSD/per-pool and slowest-ops tables; click a group row to filter the histogram to that OSD/pool

Implementation notes:

HTML/CSS/JS in a separate tools/report_template.html so the Python stays flake8/pylint-clean (embedded long JS lines would trip E501)
Python precomputes compact aggregates (overall + per-group histograms, per-slice percentiles, group stats, top-50 slowest) and inlines them as JSON — a ~116k-op capture is a ~20 KB file; all rendering is vanilla JS
still no second parser (reuses parse_line / shared column layout); < escaped in the payload so an object name can't break out of <script>
pytest covers HTML generation for both tools; flake8-clean, pylint 10/10

taodd added 2 commits June 13, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools: add cephtrace_report.py - a visual summary of trace output#140

tools: add cephtrace_report.py - a visual summary of trace output#140
taodd wants to merge 2 commits into
mainfrom
feature/visual-report

taodd commented Jun 13, 2026

Uh oh!

taodd commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taodd commented Jun 13, 2026

Summary

osdtrace (real output, ~116k ops)

radostrace (real output)

Design: reuse, don't duplicate

Quality / CI

Docs

Verified

Uh oh!

taodd commented Jun 13, 2026

Added: --html self-contained interactive report (Option C)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Added: `--html` self-contained interactive report (Option C)