Skip to content

tools: add cephtrace_report.py - a visual summary of trace output#140

Open
taodd wants to merge 2 commits into
mainfrom
feature/visual-report
Open

tools: add cephtrace_report.py - a visual summary of trace output#140
taodd wants to merge 2 commits into
mainfrom
feature/visual-report

Conversation

@taodd

@taodd taodd commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

radostrace and osdtrace stream one line per IO - thousands of rows per capture. The existing analyzers turn that into numeric tables but draw nothing, and you have to know which flags to run. tools/cephtrace_report.py is the zero-flag visual glance: point it at a log (or pipe one in), it auto-detects the tool, and prints bar charts + a log-scale latency histogram so a capture's shape is understandable at sight.

osdtrace (real output, ~116k ops)

where the time goes   (share of total latency, avg 4.1ms/op)
     messenger  ▏                                        0%
         queue  ███▋                                     9%
           osd  ▉                                        2%
     bluestore  ████████████████▌                        41%
     +kv_commit ██████████████▊                          37%  (rocksdb commit)

op latency distribution   p50 4.0ms   p95 9.5ms   p99 11.3ms   max 30.1ms
       <8ms  ████████████████████████████████████████ 39612
      <16ms  ██████████████████▍                      18164

The where the time goes block turns osdtrace's dense per-stage numbers into an instant "it's RocksDB commit, not the network or queue." A per-OSD p95 block appears when >1 OSD is captured, surfacing an outlier directly.

radostrace (real output)

radostrace summary - 353 client ops
    reads      152  ██████████████████████████████▎          43%
    writes     201  ████████████████████████████████████████ 56%

latency distribution   p50 3.5ms   p95 15.6ms   p99 19.4ms   max 19.7ms

Plus per-pool bars and a slowest-acting-set culprit hint.

Design: reuse, don't duplicate

It deliberately does not add a second parser. After reading both existing analyzers end-to-end, this:

  • reuses analyze_osdtrace_output.parse_line (handles op_r/op_w/subop_r/subop_w, optional peers, optional bluestore details)
  • reuses analyze_radostrace_output.detect_file_format and the same column layout

so it can't drift from them. It complements them - it's the visual glance; the analyzers remain for the deep-dive numbers (percentile tables, -i stage contribution, iterative host-mapped culprit ranking), and the report links to each.

Quality / CI

  • Pure Python 3 stdlib, no dependencies.
  • flake8-clean, pylint 10.00/10. (Filename is snake_case - required by the lint's naming rule and so the pytest can import it - matching every other tool in tools/.)
  • Offline pytest smoke tests over a small committed osdtrace sample + the existing radostrace samples, covering both formats, the no-rows error path, --help, and run-as-script. These run automatically via the existing tox -e test in the "Run Tests" workflow.
  • Bold headings only on a TTY, so piping/redirecting stays clean.

Docs

New doc/visualizing-output.md (with real example output), plus pointers from analyze-osdtrace.md, analyze-radostrace.md, and the README.

Verified

Run against real captures from the test VMs (single-OSD squid, 3-OSD tentacle, both space and -o CSV formats); all op types and both formats parse and render correctly.

taodd added 2 commits June 13, 2026 16:42
radostrace and osdtrace stream one line per IO - thousands of rows. The
existing analyzers turn that into numeric tables but draw nothing, and
require knowing which flags to run. cephtrace_report is the zero-flag
visual glance: it auto-detects the tool and prints bar charts plus a
log-scale latency histogram so a capture's shape is understandable at
sight.

For osdtrace the headline is a 'where the time goes' block - the share
of total latency per stage (messenger/queue/osd/bluestore, with
kv_commit broken out) - which turns the dense per-stage numbers into an
instant 'it's RocksDB commit, not the network'. radostrace gets a
read/write split, per-pool bars, and a slowest-acting-set culprit hint.

It deliberately does NOT add a second parser: the osdtrace path reuses
analyze_osdtrace_output.parse_line and the radostrace path reuses
analyze_radostrace_output.detect_file_format and the same column layout,
so it can't drift from the analyzers. It complements them (and links to
them for the deep-dive numbers) rather than replacing them. Reads a file
or stdin, and both the space format and the CSV that 'radostrace -o'
writes.

Pure stdlib; flake8-clean and pylint 10/10 (the snake_case filename is
required for both the linter and the pytest import). Adds a committed
small osdtrace sample and offline pytest smoke tests that run via the
existing tox -e test in CI. Docs: a new visualizing-output.md plus
pointers from the two analyzer docs and the README.
Adds a browser-viewable companion to the terminal summary. cephtrace_report.py
--html <file> writes a single self-contained .html (no CDN, no server, works
offline - attach it to a ticket) with everything the terminal view shows plus:

- latency over the capture (p95 per slice, arrival order) - reveals workloads
  that degrade or run in bursts, which a terminal can't show
- hover tooltips on every bar; sortable per-OSD/per-pool and slowest-ops
  tables; click a group row to filter the histogram to that OSD or pool

The HTML/CSS/JS lives in a separate report_template.html (so the Python stays
flake8/pylint clean - long JS lines in an embedded string would trip E501).
Python precomputes compact aggregates (overall + per-group histograms,
per-slice percentiles, group stats, top-50 slowest) and inlines them as JSON;
all rendering is vanilla JS over that payload, so a ~116k-op capture is a ~20KB
file. The osdtrace path reuses parse_line and the radostrace path the shared
column layout - still no second parser. '<' in the payload is escaped so an
object name can't break out of the <script>.

Refactor: histogram bucketing factored into _bucketize/_HIST_* shared by the
terminal and HTML paths; parsing factored into _load. Terminal output is
unchanged.

pytest covers HTML generation for both tools (payload parses, self-contained,
per-group histograms present, --html requires a filename); flake8-clean,
pylint 10/10. Docs updated with rendered screenshots.
@taodd

taodd commented Jun 13, 2026

Copy link
Copy Markdown
Owner Author

Added: --html self-contained interactive report (Option C)

./tools/cephtrace_report.py --html report.html osd.log now writes a single self-contained .html (no CDN, no server, works offline — attach it to a ticket) alongside the terminal view.

Beyond what the terminal shows:

  • Latency over the capture — p95 per slice in arrival order, so a workload that degrades or runs in bursts shows up as a shape (a terminal can't do time-series)
  • Interactive — hover any bar for exact counts/latencies; sortable per-OSD/per-pool and slowest-ops tables; click a group row to filter the histogram to that OSD/pool

osdtrace HTML report

Implementation notes:

  • HTML/CSS/JS in a separate tools/report_template.html so the Python stays flake8/pylint-clean (embedded long JS lines would trip E501)
  • Python precomputes compact aggregates (overall + per-group histograms, per-slice percentiles, group stats, top-50 slowest) and inlines them as JSON — a ~116k-op capture is a ~20 KB file; all rendering is vanilla JS
  • still no second parser (reuses parse_line / shared column layout); < escaped in the payload so an object name can't break out of <script>
  • pytest covers HTML generation for both tools; flake8-clean, pylint 10/10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant