Add: AICore receive_time DFX field + swimlane phase model cleanup by hw-native-sys-bot · Pull Request #1004 · hw-native-sys/simpler

hw-native-sys-bot · 2026-06-08T04:14:32Z

Summary

Adds the AICore-side receive_time DFX field that splits per-task
head_OH into AICPU→AICore-ready propagation and AICore-local
critical-path prep, and lands the surrounding swimlane phase model
cleanup that makes the new data legible:

receive_time field captured before per-task dcci + ack; stored
as 32-bit start_time − receive_time delta in the existing AICore
record (size unchanged).
setup bar auto-emitted on Worker View at level 1 already
(no need for level 3+), directly before each kernel bar.
Flow arrows land at receive_time instead of start_time so
the arrow→kernel-start gap equals local_setup visually.
Speculative early-dispatch (feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079) interaction: receive_time
is re-stamped after the gate match so propagation absorbs the
speculation overshoot and local_setup stays the pure
ack-on-critical-path cost on both paths.
Scheduler-phase noise cleanup + terminology unification +
enum cleanup (see below).
Worker View / Scheduler View rename + pid renumber so trace
layout matches pipeline order.
DummyTask phase + Worker View DUMMY_T lanes so DAG fence /
barrier nodes (no AICore presence) are visually present.

Why

Before this PR, head_OH lumped together NoC propagation latency
(hardware-bound, unfixable in software) and AICore-local dcci+ack
cost (software-tunable). Any "make head_OH smaller" investigation had
to guess which half it was targeting. With the split, the cold/warm
distribution is directly measurable from a single capture, and the
setup bar surfaces the cold/warm pattern at level 1 with no
additional instrumentation.

The follow-on cleanup landed in the same PR because the same
visualisations get used together — Resolve filtering, DummyTask
visibility, EarlyDispatch rename and Scan/Poll removal are all about
making the level-3/4 swimlane trace usable as a primary diagnosis tool
rather than a debug-only artifact.

What changed

Core: receive_time DFX field

AICore executor captures receive_time = get_sys_cnt_aicore()
just after read_reg(DATA_MAIN_BASE) returns the new task_id,
BEFORE dcci(payload, ENTIRE_DATA_CACHE) and
write_reg(COND, MAKE_ACK_VALUE).
L2SwimlaneAicoreTaskRecord gains receive_to_start_cycles in
the existing 4-byte tail, no size change, no extra cache line.
swimlane_converter.py exposes receive_time_us,
local_setup_us, and (when AICPU records are joined)
propagation_us per task.

Speculative early-dispatch (#1079) semantic alignment

For exec_payload->not_ready == 1 (a task was staged before its
dependencies resolved), the dcci ran during the doorbell-wait spin,
off the critical path. receive_time is re-stamped at the moment
the gate-wait exits on doorbell match, so:

propagation_us = receive_time − dispatch_ts absorbs the
original NoC delivery AND any speculation overshoot.
local_setup_us stays the pure ack-on-critical-path cost on
both paths (common path: dcci + ack; spec path: ack only — dcci
already hidden behind the gate-wait).

The common-path emit code is unchanged.

Setup bar on Worker View

swimlane_converter.py emits a setup Perfetto event before each
kernel bar (same tid, name = setup, ts = receive_time,
dur = local_setup) at level 1. Cycle-or-shorter intervals are
filtered out (avoid invisible 0-width bars on warm cache).
Base_time tracking now includes the receive_time anchor so the
first cold task doesn't render at a negative offset.

Flow arrow target → receive_time

dependency and hb_violation flow arrows on Worker View now land
at receive_time_us instead of start_time_us, so the gap between
arrow tip and kernel start equals local_setup visually. Falls
back to start_time_us for old captures without a v3 receive_time.

`--enable-swimlane-overhead` opt-in flag

Adds the 8 Overhead Analysis counter tracks (per-engine
idle/ready/overhead + system all/has overhead) from PR #1039 to the
swimlane JSON when requested. Plumbed through pytest (conftest.py),
standalone (simpler_setup/scene_test.py argparse),
run_class_cases, _convert_case_swimlane, _run_swimlane_converter,
and the L3 subprocess forwarding so pytest and standalone share the
same surface.

Scheduler-phase model cleanup

Drop PR feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079 debug overlays: Scan (per-pass MMIO COND
scan) and Poll (activity-fill attribution) were emitted at
level >= SCHED_PHASES but carried no actionable signal —
"scheduler is polling when there's nothing to do" is the steady
state, not a finding. Total Perfetto event count on qwen3 level=4:
51,381 → 2,952 (17× reduction). flush_activity_fill,
fill_kind/fill_start/fill_end fields, and the per-iteration
scan_cores++ instrumentation all removed.

Terminology unification: four renames touching enum, function,
variable, comment, and converter color keys:

Old	New	Why
`Prestage` / `try_speculative_prestage` / `prestage_queue`	`EarlyDispatch` / `try_speculative_early_dispatch` / `early_dispatch_queue`	PR feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079 internal jargon → the feature name
`Fanout` (phase)	`Resolve`	"Fanout" overloaded the graph-theory term; the phase names the action (release/resolve successor fanin)
`on_mixed_task_complete`	`on_task_complete`	Fires for every task that completes (MIX or single-subtask); the "mixed" modifier was historical
`bool mixed_complete`	`task_complete`	Mirrors the above at the caller

DummyTask phase + Worker View DUMMY lanes: new
L2SwimlaneSchedPhaseKind::DummyTask emitted once per dummy in
dummy_drain (1-cycle wide, tasks_processed = task_token raw
low 32 bits). Converter routes it to Worker View pid=4
DUMMY_T{thread} lanes so DAG fence / barrier nodes (no AICore
presence) are visually present. The accompanying Resolve bar on
the Scheduler track covers the consumer-release work that follows.
Resolve quality: >= 1 µs filter drops the ~88% of tasks whose
consumer-release walk is sub-microsecond — leaving only the
broadcast / reduction Resolves that carry signal. tasks_processed
now carries the real consumer-walk count plumbed back from
on_task_complete (non-PROFILING return type uint32_t;
PROFILING uses CompletionStats::fanout_edges). Resolve is also
emitted from the dummy_drain path — previously a measurement
blind spot (work happened, no bar).

pid renumber + Process rename

pid now reflects pipeline order (top → bottom in Perfetto), with
sort_index == pid for self-evident layout:

pid=1  AICPU Orchestrator   submit envelope (earliest)
pid=2  AICPU Scheduler      Complete/Dispatch/Release/Resolve/EarlyDispatch
pid=3  Scheduler View       AICPU-eye dispatch→finish per worker
pid=4  Worker View          AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest)

Was: pid=1 AICore View / pid=2 AICPU View. Renamed to
Worker View / Scheduler View since AICPU also serves as
worker for dummy tasks. No compat shim for old captures.

Enum cleanup

L2SwimlaneSchedPhaseKind collapsed from 8 (with reserved
Poll = 2 / Scan = 5 for legacy capture compat) to 6 sequential:

enum class L2SwimlaneSchedPhaseKind : uint32_t {
    Complete = 0,
    Dispatch = 1,
    Release = 2,
    Resolve = 3,
    EarlyDispatch = 4,
    DummyTask = 5,
};

a5 mirror

Renames mirrored to a5 tensormap_and_ringbuffer:
Fanout → Resolve, on_mixed_task_complete → on_task_complete,
bool mixed_complete → bool task_complete. a5 has no PR #1079
speculative path so the not_ready re-stamp and Scan/Poll/Prestage
cleanup are a2a3-only.

Hot-path cost

One extra get_sys_cnt_aicore() per task in the AICore executor's
task-arrived branch (single-cycle MSR read; negligible vs the
existing per-task dcci(payload, ENTIRE_DATA_CACHE) that already
costs ~50 cycles cold / ~0 warm).
not_ready == 1 path: one additional MSR read at the gate-exit
(only fires on speculative-staged tasks).
1 µs Resolve emit filter trades phase record bandwidth for signal —
saves emit + storage at level ≥ 3.

Test plan

Schema notes

aicore_tasks JSON tuples grew from 5 to 6 columns (v3). The
converter accepts both via *rest unpack — archived v2 JSON loads
with receive_to_start_cycles defaulting to 0, no breakage.
aicpu_scheduler_phases records now include dummy_task and
resolve kinds; the converter recognises both. Old captures with
scan / poll / fanout / prestage strings still parse — those
kinds are reported as-is but the runtime no longer emits them.
pid renumber is a hard cut-over: old .json captures opened in the
current converter (or vice-versa) will display with wrong process
names. Acceptable per the "latest tools display correctly" policy.

coderabbitai · 2026-06-08T04:14:44Z

📝 Walkthrough

Walkthrough

This PR adds receive_time timestamp capture to AICore task profiling in both a2a3 and a5 platforms. The runtime captures a new timing point after task ID is detected, computes a cycle delta (receive_to_start_cycles), exports it via JSON, and the host-side converter derives propagation and local-setup timing metrics. Statistics output now displays these new metrics.

Changes

AICore Task Receive-to-Start Timing Feature

Layer / File(s)	Summary
Task record structure and function API `src/a2a3/platform/include/common/l2_swimlane_profiling.h`, `src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`, `src/a5/platform/include/common/l2_swimlane_profiling.h`, `src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h`	`L2SwimlaneAicoreTaskRecord` gains `receive_to_start_cycles` field (replacing `_pad`) in both a2a3 and a5. `l2_swimlane_aicore_record_task` signature updated to accept `receive_time` parameter and associated Doxygen documentation.
Runtime receive_time capture in executors `src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`, `src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`	Executors in both platforms and both build variants capture `receive_time` immediately after detecting new task ID from `DATA_MAIN_BASE`, before ACK and cache invalidation sequences.
Recording function delta computation `src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`, `src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h`	`l2_swimlane_aicore_record_task` implementations compute `receive_to_start_cycles` as the 32-bit cast of `(start_time - receive_time)` and write it to task record.
Executor calls to updated recording function `src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`, `src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`	Call sites to `l2_swimlane_aicore_record_task` updated to pass newly captured `receive_time` in function parameter list.
Host JSON export schema for aicore_tasks `src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`, `src/a5/platform/shared/host/l2_swimlane_collector.cpp`	`export_swimlane_json()` emits `receive_to_start_cycles` as an additional final element in each `aicore_tasks` tuple. Schema documentation updated to reflect new column layout.
Host converter v3 schema parsing and metric derivation `simpler_setup/tools/swimlane_converter.py`	`read_perf_data` extended to parse v3 `aicore_tasks` format with `receive_to_start_cycles` column. Derives `dispatch_time_us`, `receive_time_us`, `local_setup_us`, and `propagation_us` from the delta when joining AICPU↔AICore records. Level-1 fallback updated to populate `local_setup_us` from optional column.
Task statistics printing with new propagation/local-setup metrics `simpler_setup/tools/swimlane_converter.py`	`print_task_statistics` adds aggregation lists for propagation and local-setup metrics, conditionally populated when present. Table output extended with "Avg Prop(us)" and "Avg Local(us)" columns, rendering dashes for absent v3 data.
Test field validation updates `tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py`	`_REQUIRED_TASK_FIELDS` extended to require `receive_time_us` and `local_setup_us` keys in perf task records.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hw-native-sys/simpler#974: Directly modifies L2SwimlaneAicoreTaskRecord and l2_swimlane_aicore_record_task in the same profiling infrastructure; receive_time/receive_to_start_cycles changes build on or intersect with that PR's refactoring of task identity and record handling.
hw-native-sys/simpler#942: Changes the AICore task recording API signature and plumbing in l2_swimlane_collector_aicore.h across both platforms; this PR's addition of receive_time parameter depends on or is adjacent to that PR's API refactor.
hw-native-sys/simpler#985: Modifies simpler_setup/tools/swimlane_converter.py host converter logic for read_perf_data and JSON schema handling; this PR extends the same conversion path to handle v3 receive_to_start_cycles field and derived metrics.

Poem

🐰 A rabbit hops through cycles fine,
From receive to start, a timing line,
The delta dances, round and bright,
Profiling logs the AICore's flight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change: adding an AICore receive_time DFX field and swimlane phase model cleanup, which directly reflects the primary objectives and changes throughout the PR.
Description check	✅ Passed	The description is comprehensive and thoroughly related to the changeset, covering the receive_time field implementation, phase model cleanup, schema changes, and test verification across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a v3 schema for L2 swimlane profiling to split the head overhead into AICPU-to-AICore NoC propagation and AICore-local setup (dcci + ack) costs. This is achieved by capturing a new receive_time timestamp on the AICore before the ack write, storing it as a 32-bit delta (receive_to_start_cycles) in the L2SwimlaneAicoreTaskRecord struct, and updating the host collector, python converter, and validation tests to process and display these new metrics. Feedback on the changes highlights a potential issue where the derived receive_time can precede the calculated base_time_cycles, resulting in negative relative timestamps. A code suggestion is provided to track start_cycles - r2s_cycles in the base time calculation to prevent this.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-08T04:16:38Z

+    for row in aicore_rows:
+        # Column count varies (v2: 5, v3: 6); only the timing columns matter
+        # for base_time tracking.
+        _track(int(row[3]))
+        _track(int(row[4]))


Since receive_time is derived as start_time - receive_to_start_cycles, it can precede the calculated base_time_cycles if the earliest task's start_time is used as the base. This would result in a negative receive_time_us timestamp, which can cause rendering or validation issues in downstream trace analysis tools that expect non-negative relative timestamps.

To prevent negative timestamps, we should track the calculated receive_time (i.e., start_cycles - r2s_cycles) in the base_time_cycles calculation instead of just start_cycles.

Suggested change

for row in aicore_rows:

# Column count varies (v2: 5, v3: 6); only the timing columns matter

# for base_time tracking.

_track(int(row[3]))

_track(int(row[4]))

for row in aicore_rows:

# Column count varies (v2: 5, v3: 6); only the timing columns matter

# for base_time tracking.

start_cycles = int(row[3])

r2s_cycles = int(row[5]) if len(row) > 5 else 0

_track(start_cycles - r2s_cycles)

_track(int(row[4]))

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 103-119: Replace the Unicode minus sign (U+2212) with a standard
ASCII hyphen-minus (U+002D) in the module docstring where the expression
"start_time − receive_time" appears; locate the string in
simpler_setup/tools/swimlane_converter.py (the docstring describing aicore_tasks
v3 schema) and change "−" to "-" so the text reads "start_time - receive_time"
to avoid RUF002/clipboard issues.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62ad57d5-5d19-498f-9096-412b834f6bc8

📥 Commits

Reviewing files that changed from the base of the PR and between 98849e8 and e4bdf5a.

📒 Files selected for processing (12)

simpler_setup/tools/swimlane_converter.py
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
src/a5/platform/include/common/l2_swimlane_profiling.h
src/a5/platform/shared/host/l2_swimlane_collector.cpp
src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py

coderabbitai · 2026-06-08T04:20:02Z

+          "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,
+                            end_cycles, receive_to_start_cycles], ...],
          "aicpu_tasks":  [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],
          "aicpu_scheduler_phases":     [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],
          "aicpu_orchestrator_phases":  [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]
        }

+    aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles
+    is a uint32 delta = AICore-side `start_time − receive_time`, where
+    receive_time is captured immediately after AICore's
+    `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task
+    dcci + ack pair). Lets DFX split per-task head_OH into the
+    AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware-
+    bound) and the AICore-local dcci + ack cost (receive_time → start_time,
+    software-tunable). Archived v2 JSON without this column still parses;
+    the field is exposed as 0 for those.
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Unicode minus sign in docstring.

Line 111 contains a Unicode MINUS SIGN (−, U+2212) instead of HYPHEN-MINUS (-, U+002D) in the expression start_time − receive_time. This can cause copy-paste issues and is flagged by Ruff RUF002.

🔧 Proposed fix

- is a uint32 delta = AICore-side `start_time − receive_time`, where + is a uint32 delta = AICore-side `start_time - receive_time`, where

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,

end_cycles, receive_to_start_cycles], ...],

"aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],

"aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],

"aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]

}

aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles

is a uint32 delta = AICore-side `start_time − receive_time`, where

receive_time is captured immediately after AICore's

`read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task

dcci + ack pair). Lets DFX split per-task head_OH into the

AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware-

bound) and the AICore-local dcci + ack cost (receive_time → start_time,

software-tunable). Archived v2 JSON without this column still parses;

the field is exposed as 0 for those.

"aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,

end_cycles, receive_to_start_cycles], ...],

"aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],

"aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],

"aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]

}

aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles

is a uint32 delta = AICore-side `start_time - receive_time`, where

receive_time is captured immediately after AICore's

`read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task

dcci + ack pair). Lets DFX split per-task head_OH into the

AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware-

bound) and the AICore-local dcci + ack cost (receive_time → start_time,

software-tunable). Archived v2 JSON without this column still parses;

the field is exposed as 0 for those.

🧰 Tools

🪛 Ruff (0.15.15)

[warning] 111-111: Docstring contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/tools/swimlane_converter.py` around lines 103 - 119, Replace the Unicode minus sign (U+2212) with a standard ASCII hyphen-minus (U+002D) in the module docstring where the expression "start_time − receive_time" appears; locate the string in simpler_setup/tools/swimlane_converter.py (the docstring describing aicore_tasks v3 schema) and change "−" to "-" so the text reads "start_time - receive_time" to avoid RUF002/clipboard issues.

Core change ----------- Capture per-task `receive_time` on AICore immediately after `read_reg(DATA_MAIN_BASE)` returns a new task_id, BEFORE the per-task `dcci(payload, ENTIRE_DATA_CACHE) + write_reg(COND, ACK)` pair that precedes start_time. Stored as a 32-bit delta `start_time - receive_time` in the AICore record (size unchanged, 2 records per cache line). Splits the per-task head_OH into two physically distinct halves so DFX can attribute each: - `propagation_us` = receive_time - dispatch_ts: AICPU->AICore-ready delivery (NoC + any speculation overshoot). - `local_setup_us` = start_time - receive_time: AICore-local critical-path prep (dcci + ack on the common path; ack-only on the speculative-hit path). Semantic alignment with speculative early-dispatch (hw-native-sys#1079) --------------------------------------------------------- For not_ready==1 (speculative pre-staging) the dcci ran during the dependency-wait spin, off the critical path. receive_time is re-stamped at the moment the doorbell match exits the gate, so propagation absorbs both the original NoC delivery AND any speculation overshoot, while local_setup stays the pure ack-on-critical-path cost. Common-path emits unchanged. Visualization ------------- - Setup bar at level 1: swimlane_converter emits a `setup` sub-bar (ts=receive_time, dur=local_setup_us) directly before each kernel bar on Worker View. >=1-cycle filter suppresses warm-cache zero-width bars. Base_time tracking includes receive_time so the first cold task no longer renders at a negative offset. - Flow arrows (deps.json -> Worker View) now land at receive_time instead of start_time so the gap between arrow tip and kernel start visually equals local_setup. - New `--enable-swimlane-overhead` flag (pytest + standalone + conftest + L3 subprocess forwarding) opt-in for the 8 Overhead Analysis counter tracks from PR hw-native-sys#1039. Scheduler-phase model cleanup ----------------------------- - Drop PR hw-native-sys#1079 debug overlays: `Scan` (per-pass MMIO COND scan) and `Poll` (activity-fill attribution) were emitted at level>=3 but carried no actionable signal — "scheduler is polling when there's nothing to do" is the steady state, not a finding. Total Perfetto event count on qwen3 level=4 drops 51,381 -> 2,952. - `Prestage` -> `EarlyDispatch` (PR hw-native-sys#1079 internal jargon -> the feature name from its PR title). enum value, function name (`try_speculative_early_dispatch`), queue field, and converter color key all renamed consistently. - `Fanout` -> `Resolve`. The phase covers `on_task_complete`'s consumer-release walk (decrement consumer fanin, push newly-ready, ring speculative doorbells). "Fanout" overloaded the graph-theory term; "Resolve" names the action. - `on_mixed_task_complete` -> `on_task_complete`. The function fires for every task that completes (MIX or single-subtask), the "mixed" modifier was historical. - `bool mixed_complete` -> `task_complete` at the caller. - New `DummyTask` phase kind, emitted once per dummy in `dummy_drain`. Converter routes it to Worker View pid=4 DUMMY_T{thread} lanes so DAG fence/barrier nodes (no AICore presence) are visually present. The accompanying Resolve bar covers the consumer-release work. - Resolve emit >=1µs filter: drops the ~88% of tasks whose consumer-release walk is sub-microsecond, leaving only the broadcast/reduction Resolves that carry signal. `tasks_processed` now carries the real consumer-walk count plumbed back from `on_task_complete` (non-PROFILING return type uint32_t; PROFILING uses `CompletionStats::fanout_edges`). - Resolve emit from dummy_drain too (previously a measurement blind spot — work happened, no bar). pid renumber + Process rename ----------------------------- pid is now in pipeline order (top -> bottom in Perfetto), with sort_index == pid for self-evident layout: pid=1 AICPU Orchestrator submit envelope (earliest) pid=2 AICPU Scheduler Complete/Dispatch/Release/Resolve/EarlyDispatch pid=3 Scheduler View AICPU-eye dispatch->finish per worker pid=4 Worker View AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest) Was: pid=1 AICore View / pid=2 AICPU View — renamed to Worker View / Scheduler View since AICPU also serves as worker for dummy tasks. No compat shim for old captures. Enum cleanup ------------ `L2SwimlaneSchedPhaseKind` collapsed from 8 (with reserved Poll/Scan) to 6 sequential: 0 Complete, 1 Dispatch, 2 Release, 3 Resolve, 4 EarlyDispatch, 5 DummyTask. a5 mirror --------- Renames mirrored to a5 tensormap_and_ringbuffer (Fanout/Resolve, on _task_complete, mixed_complete) for cross-arch symmetry. a5 has no PR hw-native-sys#1079 speculative path so the not_ready re-stamp and Scan/Poll/ Prestage cleanup are a2a3-only. Hot-path cost ------------- - One extra get_sys_cnt_aicore() per task in the AICore executor's task-arrived branch (cycle MSR read, negligible vs the existing per-task dcci(payload, ENTIRE_DATA_CACHE)). - 1µs Resolve filter trades phase records for signal — saves emit bandwidth at level>=3. Verified -------- - a2a3 onboard build + a5 onboard build via `pip install` - ST: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask + `--enable-swimlane-overhead` opt-in path - qwen3 decode_layer level=1 and level=4 end-to-end PASS, swimlane artifacts reviewed (setup bars + Resolve bars + DUMMY lanes visible where expected; Scan/Poll bars absent) - swimlane_converter `pid` mapping in output matches spec

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

hw-native-sys-bot force-pushed the feat/aicore-receive-time branch from e4bdf5a to 99b5088 Compare June 8, 2026 04:20

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

hw-native-sys-bot force-pushed the feat/aicore-receive-time branch 2 times, most recently from dafc19c to 3ebad46 Compare June 9, 2026 09:40

hw-native-sys-bot force-pushed the feat/aicore-receive-time branch from 3ebad46 to 5bebbd2 Compare June 22, 2026 03:53

hw-native-sys-bot changed the title ~~Add: AICore receive_time DFX field — split head_OH into NoC + dcci/ack~~ Add: AICore receive_time DFX field + swimlane phase model cleanup Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: AICore receive_time DFX field + swimlane phase model cleanup#1004

Add: AICore receive_time DFX field + swimlane phase model cleanup#1004
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-receive-time

hw-native-sys-bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Core: receive_time DFX field

Speculative early-dispatch (#1079) semantic alignment

Setup bar on Worker View

Flow arrow target → receive_time

--enable-swimlane-overhead opt-in flag

Scheduler-phase model cleanup

pid renumber + Process rename

Enum cleanup

a5 mirror

Hot-path cost

Test plan

Schema notes

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented Jun 8, 2026 •

edited

Loading

`--enable-swimlane-overhead` opt-in flag

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading