Add: AICore receive_time DFX field + swimlane phase model cleanup#1004
Add: AICore receive_time DFX field + swimlane phase model cleanup#1004hw-native-sys-bot wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR adds ChangesAICore Task Receive-to-Start Timing Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a v3 schema for L2 swimlane profiling to split the head overhead into AICPU-to-AICore NoC propagation and AICore-local setup (dcci + ack) costs. This is achieved by capturing a new receive_time timestamp on the AICore before the ack write, storing it as a 32-bit delta (receive_to_start_cycles) in the L2SwimlaneAicoreTaskRecord struct, and updating the host collector, python converter, and validation tests to process and display these new metrics. Feedback on the changes highlights a potential issue where the derived receive_time can precede the calculated base_time_cycles, resulting in negative relative timestamps. A code suggestion is provided to track start_cycles - r2s_cycles in the base time calculation to prevent this.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for row in aicore_rows: | ||
| # Column count varies (v2: 5, v3: 6); only the timing columns matter | ||
| # for base_time tracking. | ||
| _track(int(row[3])) | ||
| _track(int(row[4])) |
There was a problem hiding this comment.
Since receive_time is derived as start_time - receive_to_start_cycles, it can precede the calculated base_time_cycles if the earliest task's start_time is used as the base. This would result in a negative receive_time_us timestamp, which can cause rendering or validation issues in downstream trace analysis tools that expect non-negative relative timestamps.
To prevent negative timestamps, we should track the calculated receive_time (i.e., start_cycles - r2s_cycles) in the base_time_cycles calculation instead of just start_cycles.
| for row in aicore_rows: | |
| # Column count varies (v2: 5, v3: 6); only the timing columns matter | |
| # for base_time tracking. | |
| _track(int(row[3])) | |
| _track(int(row[4])) | |
| for row in aicore_rows: | |
| # Column count varies (v2: 5, v3: 6); only the timing columns matter | |
| # for base_time tracking. | |
| start_cycles = int(row[3]) | |
| r2s_cycles = int(row[5]) if len(row) > 5 else 0 | |
| _track(start_cycles - r2s_cycles) | |
| _track(int(row[4])) |
e4bdf5a to
99b5088
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 103-119: Replace the Unicode minus sign (U+2212) with a standard
ASCII hyphen-minus (U+002D) in the module docstring where the expression
"start_time − receive_time" appears; locate the string in
simpler_setup/tools/swimlane_converter.py (the docstring describing aicore_tasks
v3 schema) and change "−" to "-" so the text reads "start_time - receive_time"
to avoid RUF002/clipboard issues.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 62ad57d5-5d19-498f-9096-412b834f6bc8
📒 Files selected for processing (12)
simpler_setup/tools/swimlane_converter.pysrc/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/shared/host/l2_swimlane_collector.cppsrc/a2a3/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a5/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a5/platform/include/common/l2_swimlane_profiling.hsrc/a5/platform/shared/host/l2_swimlane_collector.cppsrc/a5/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpptests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py
| "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, | ||
| end_cycles, receive_to_start_cycles], ...], | ||
| "aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...], | ||
| "aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ], | ||
| "aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ] | ||
| } | ||
|
|
||
| aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles | ||
| is a uint32 delta = AICore-side `start_time − receive_time`, where | ||
| receive_time is captured immediately after AICore's | ||
| `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task | ||
| dcci + ack pair). Lets DFX split per-task head_OH into the | ||
| AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware- | ||
| bound) and the AICore-local dcci + ack cost (receive_time → start_time, | ||
| software-tunable). Archived v2 JSON without this column still parses; | ||
| the field is exposed as 0 for those. | ||
|
|
There was a problem hiding this comment.
Fix Unicode minus sign in docstring.
Line 111 contains a Unicode MINUS SIGN (−, U+2212) instead of HYPHEN-MINUS (-, U+002D) in the expression start_time − receive_time. This can cause copy-paste issues and is flagged by Ruff RUF002.
🔧 Proposed fix
- is a uint32 delta = AICore-side `start_time − receive_time`, where
+ is a uint32 delta = AICore-side `start_time - receive_time`, where📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, | |
| end_cycles, receive_to_start_cycles], ...], | |
| "aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...], | |
| "aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ], | |
| "aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ] | |
| } | |
| aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles | |
| is a uint32 delta = AICore-side `start_time − receive_time`, where | |
| receive_time is captured immediately after AICore's | |
| `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task | |
| dcci + ack pair). Lets DFX split per-task head_OH into the | |
| AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware- | |
| bound) and the AICore-local dcci + ack cost (receive_time → start_time, | |
| software-tunable). Archived v2 JSON without this column still parses; | |
| the field is exposed as 0 for those. | |
| "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, | |
| end_cycles, receive_to_start_cycles], ...], | |
| "aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...], | |
| "aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ], | |
| "aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ] | |
| } | |
| aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles | |
| is a uint32 delta = AICore-side `start_time - receive_time`, where | |
| receive_time is captured immediately after AICore's | |
| `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task | |
| dcci + ack pair). Lets DFX split per-task head_OH into the | |
| AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware- | |
| bound) and the AICore-local dcci + ack cost (receive_time → start_time, | |
| software-tunable). Archived v2 JSON without this column still parses; | |
| the field is exposed as 0 for those. |
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 111-111: Docstring contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@simpler_setup/tools/swimlane_converter.py` around lines 103 - 119, Replace
the Unicode minus sign (U+2212) with a standard ASCII hyphen-minus (U+002D) in
the module docstring where the expression "start_time − receive_time" appears;
locate the string in simpler_setup/tools/swimlane_converter.py (the docstring
describing aicore_tasks v3 schema) and change "−" to "-" so the text reads
"start_time - receive_time" to avoid RUF002/clipboard issues.
dafc19c to
3ebad46
Compare
Core change ----------- Capture per-task `receive_time` on AICore immediately after `read_reg(DATA_MAIN_BASE)` returns a new task_id, BEFORE the per-task `dcci(payload, ENTIRE_DATA_CACHE) + write_reg(COND, ACK)` pair that precedes start_time. Stored as a 32-bit delta `start_time - receive_time` in the AICore record (size unchanged, 2 records per cache line). Splits the per-task head_OH into two physically distinct halves so DFX can attribute each: - `propagation_us` = receive_time - dispatch_ts: AICPU->AICore-ready delivery (NoC + any speculation overshoot). - `local_setup_us` = start_time - receive_time: AICore-local critical-path prep (dcci + ack on the common path; ack-only on the speculative-hit path). Semantic alignment with speculative early-dispatch (hw-native-sys#1079) --------------------------------------------------------- For not_ready==1 (speculative pre-staging) the dcci ran during the dependency-wait spin, off the critical path. receive_time is re-stamped at the moment the doorbell match exits the gate, so propagation absorbs both the original NoC delivery AND any speculation overshoot, while local_setup stays the pure ack-on-critical-path cost. Common-path emits unchanged. Visualization ------------- - Setup bar at level 1: swimlane_converter emits a `setup` sub-bar (ts=receive_time, dur=local_setup_us) directly before each kernel bar on Worker View. >=1-cycle filter suppresses warm-cache zero-width bars. Base_time tracking includes receive_time so the first cold task no longer renders at a negative offset. - Flow arrows (deps.json -> Worker View) now land at receive_time instead of start_time so the gap between arrow tip and kernel start visually equals local_setup. - New `--enable-swimlane-overhead` flag (pytest + standalone + conftest + L3 subprocess forwarding) opt-in for the 8 Overhead Analysis counter tracks from PR hw-native-sys#1039. Scheduler-phase model cleanup ----------------------------- - Drop PR hw-native-sys#1079 debug overlays: `Scan` (per-pass MMIO COND scan) and `Poll` (activity-fill attribution) were emitted at level>=3 but carried no actionable signal — "scheduler is polling when there's nothing to do" is the steady state, not a finding. Total Perfetto event count on qwen3 level=4 drops 51,381 -> 2,952. - `Prestage` -> `EarlyDispatch` (PR hw-native-sys#1079 internal jargon -> the feature name from its PR title). enum value, function name (`try_speculative_early_dispatch`), queue field, and converter color key all renamed consistently. - `Fanout` -> `Resolve`. The phase covers `on_task_complete`'s consumer-release walk (decrement consumer fanin, push newly-ready, ring speculative doorbells). "Fanout" overloaded the graph-theory term; "Resolve" names the action. - `on_mixed_task_complete` -> `on_task_complete`. The function fires for every task that completes (MIX or single-subtask), the "mixed" modifier was historical. - `bool mixed_complete` -> `task_complete` at the caller. - New `DummyTask` phase kind, emitted once per dummy in `dummy_drain`. Converter routes it to Worker View pid=4 DUMMY_T{thread} lanes so DAG fence/barrier nodes (no AICore presence) are visually present. The accompanying Resolve bar covers the consumer-release work. - Resolve emit >=1µs filter: drops the ~88% of tasks whose consumer-release walk is sub-microsecond, leaving only the broadcast/reduction Resolves that carry signal. `tasks_processed` now carries the real consumer-walk count plumbed back from `on_task_complete` (non-PROFILING return type uint32_t; PROFILING uses `CompletionStats::fanout_edges`). - Resolve emit from dummy_drain too (previously a measurement blind spot — work happened, no bar). pid renumber + Process rename ----------------------------- pid is now in pipeline order (top -> bottom in Perfetto), with sort_index == pid for self-evident layout: pid=1 AICPU Orchestrator submit envelope (earliest) pid=2 AICPU Scheduler Complete/Dispatch/Release/Resolve/EarlyDispatch pid=3 Scheduler View AICPU-eye dispatch->finish per worker pid=4 Worker View AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest) Was: pid=1 AICore View / pid=2 AICPU View — renamed to Worker View / Scheduler View since AICPU also serves as worker for dummy tasks. No compat shim for old captures. Enum cleanup ------------ `L2SwimlaneSchedPhaseKind` collapsed from 8 (with reserved Poll/Scan) to 6 sequential: 0 Complete, 1 Dispatch, 2 Release, 3 Resolve, 4 EarlyDispatch, 5 DummyTask. a5 mirror --------- Renames mirrored to a5 tensormap_and_ringbuffer (Fanout/Resolve, on _task_complete, mixed_complete) for cross-arch symmetry. a5 has no PR hw-native-sys#1079 speculative path so the not_ready re-stamp and Scan/Poll/ Prestage cleanup are a2a3-only. Hot-path cost ------------- - One extra get_sys_cnt_aicore() per task in the AICore executor's task-arrived branch (cycle MSR read, negligible vs the existing per-task dcci(payload, ENTIRE_DATA_CACHE)). - 1µs Resolve filter trades phase records for signal — saves emit bandwidth at level>=3. Verified -------- - a2a3 onboard build + a5 onboard build via `pip install` - ST: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask + `--enable-swimlane-overhead` opt-in path - qwen3 decode_layer level=1 and level=4 end-to-end PASS, swimlane artifacts reviewed (setup bars + Resolve bars + DUMMY lanes visible where expected; Scan/Poll bars absent) - swimlane_converter `pid` mapping in output matches spec
3ebad46 to
5bebbd2
Compare
Summary
Adds the AICore-side
receive_timeDFX field that splits per-taskhead_OHinto AICPU→AICore-ready propagation and AICore-localcritical-path prep, and lands the surrounding swimlane phase model
cleanup that makes the new data legible:
dcci + ack; storedas 32-bit
start_time − receive_timedelta in the existing AICorerecord (size unchanged).
setupbar auto-emitted on Worker View at level 1 already(no need for level 3+), directly before each kernel bar.
receive_timeinstead ofstart_timesothe arrow→kernel-start gap equals
local_setupvisually.receive_timeis re-stamped after the gate match so propagation absorbs the
speculation overshoot and
local_setupstays the pureack-on-critical-path cost on both paths.
enum cleanup (see below).
layout matches pipeline order.
barrier nodes (no AICore presence) are visually present.
Why
Before this PR, head_OH lumped together NoC propagation latency
(hardware-bound, unfixable in software) and AICore-local dcci+ack
cost (software-tunable). Any "make head_OH smaller" investigation had
to guess which half it was targeting. With the split, the cold/warm
distribution is directly measurable from a single capture, and the
setupbar surfaces the cold/warm pattern at level 1 with noadditional instrumentation.
The follow-on cleanup landed in the same PR because the same
visualisations get used together — Resolve filtering, DummyTask
visibility, EarlyDispatch rename and Scan/Poll removal are all about
making the level-3/4 swimlane trace usable as a primary diagnosis tool
rather than a debug-only artifact.
What changed
Core: receive_time DFX field
AICore executorcapturesreceive_time = get_sys_cnt_aicore()just after
read_reg(DATA_MAIN_BASE)returns the new task_id,BEFORE
dcci(payload, ENTIRE_DATA_CACHE)andwrite_reg(COND, MAKE_ACK_VALUE).L2SwimlaneAicoreTaskRecordgainsreceive_to_start_cyclesinthe existing 4-byte tail, no size change, no extra cache line.
swimlane_converter.pyexposesreceive_time_us,local_setup_us, and (when AICPU records are joined)propagation_usper task.Speculative early-dispatch (#1079) semantic alignment
For
exec_payload->not_ready == 1(a task was staged before itsdependencies resolved), the dcci ran during the doorbell-wait spin,
off the critical path.
receive_timeis re-stamped at the momentthe gate-wait exits on doorbell match, so:
propagation_us=receive_time − dispatch_tsabsorbs theoriginal NoC delivery AND any speculation overshoot.
local_setup_usstays the pure ack-on-critical-path cost onboth paths (common path: dcci + ack; spec path: ack only — dcci
already hidden behind the gate-wait).
The common-path emit code is unchanged.
Setup bar on Worker View
swimlane_converter.pyemits asetupPerfetto event before eachkernel bar (same tid, name =
setup, ts = receive_time,dur = local_setup) at level 1. Cycle-or-shorter intervals are
filtered out (avoid invisible 0-width bars on warm cache).
Base_time tracking now includes the receive_time anchor so the
first cold task doesn't render at a negative offset.
Flow arrow target → receive_time
dependencyandhb_violationflow arrows on Worker View now landat
receive_time_usinstead ofstart_time_us, so the gap betweenarrow tip and kernel start equals
local_setupvisually. Fallsback to
start_time_usfor old captures without a v3 receive_time.--enable-swimlane-overheadopt-in flagAdds the 8 Overhead Analysis counter tracks (per-engine
idle/ready/overhead + system all/has overhead) from PR #1039 to the
swimlane JSON when requested. Plumbed through pytest (
conftest.py),standalone (
simpler_setup/scene_test.pyargparse),run_class_cases,_convert_case_swimlane,_run_swimlane_converter,and the L3 subprocess forwarding so pytest and standalone share the
same surface.
Scheduler-phase model cleanup
Drop PR feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079 debug overlays:
Scan(per-pass MMIO CONDscan) and
Poll(activity-fill attribution) were emitted atlevel >= SCHED_PHASESbut carried no actionable signal —"scheduler is polling when there's nothing to do" is the steady
state, not a finding. Total Perfetto event count on qwen3 level=4:
51,381 → 2,952 (17× reduction).
flush_activity_fill,fill_kind/fill_start/fill_endfields, and the per-iterationscan_cores++instrumentation all removed.Terminology unification: four renames touching enum, function,
variable, comment, and converter color keys:
Prestage/try_speculative_prestage/prestage_queueEarlyDispatch/try_speculative_early_dispatch/early_dispatch_queueFanout(phase)Resolveon_mixed_task_completeon_task_completebool mixed_completetask_completeDummyTask phase + Worker View DUMMY lanes: new
L2SwimlaneSchedPhaseKind::DummyTaskemitted once per dummy indummy_drain(1-cycle wide,tasks_processed= task_token rawlow 32 bits). Converter routes it to Worker View pid=4
DUMMY_T{thread} lanes so DAG fence / barrier nodes (no AICore
presence) are visually present. The accompanying Resolve bar on
the Scheduler track covers the consumer-release work that follows.
Resolve quality:
>= 1 µsfilter drops the ~88% of tasks whoseconsumer-release walk is sub-microsecond — leaving only the
broadcast / reduction Resolves that carry signal.
tasks_processednow carries the real consumer-walk count plumbed back from
on_task_complete(non-PROFILING return typeuint32_t;PROFILING uses
CompletionStats::fanout_edges). Resolve is alsoemitted from the
dummy_drainpath — previously a measurementblind spot (work happened, no bar).
pid renumber + Process rename
pid now reflects pipeline order (top → bottom in Perfetto), with
sort_index == pidfor self-evident layout:Was:
pid=1 AICore View/pid=2 AICPU View. Renamed toWorker View / Scheduler View since AICPU also serves as
worker for dummy tasks. No compat shim for old captures.
Enum cleanup
L2SwimlaneSchedPhaseKindcollapsed from 8 (with reservedPoll = 2/Scan = 5for legacy capture compat) to 6 sequential:a5 mirror
Renames mirrored to
a5 tensormap_and_ringbuffer:Fanout→Resolve,on_mixed_task_complete→on_task_complete,bool mixed_complete→bool task_complete. a5 has no PR #1079speculative path so the
not_readyre-stamp and Scan/Poll/Prestagecleanup are a2a3-only.
Hot-path cost
get_sys_cnt_aicore()per task in the AICore executor'stask-arrived branch (single-cycle MSR read; negligible vs the
existing per-task
dcci(payload, ENTIRE_DATA_CACHE)that alreadycosts ~50 cycles cold / ~0 warm).
not_ready == 1path: one additional MSR read at the gate-exit(only fires on speculative-staged tasks).
saves emit + storage at level ≥ 3.
Test plan
pip install --no-build-isolation .TestL2Swimlane+TestL2SwimlaneMixed+TestDummyTask(3/3)--enable-swimlane-overheadopt-in path verified (37 counter events when on; 0 when off)scan/pollevents in current capturespid=1Orchestrator @ sort_index 1 →pid=4Worker View @ sort_index 4)Schema notes
aicore_tasksJSON tuples grew from 5 to 6 columns (v3). Theconverter accepts both via
*restunpack — archived v2 JSON loadswith
receive_to_start_cyclesdefaulting to 0, no breakage.aicpu_scheduler_phasesrecords now includedummy_taskandresolvekinds; the converter recognises both. Old captures withscan/poll/fanout/prestagestrings still parse — thosekinds are reported as-is but the runtime no longer emits them.
.jsoncaptures opened in thecurrent converter (or vice-versa) will display with wrong process
names. Acceptable per the "latest tools display correctly" policy.