Skip to content

Add: AICore receive_time DFX field + swimlane phase model cleanup#1004

Open
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-receive-time
Open

Add: AICore receive_time DFX field + swimlane phase model cleanup#1004
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:feat/aicore-receive-time

Conversation

@hw-native-sys-bot

@hw-native-sys-bot hw-native-sys-bot commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the AICore-side receive_time DFX field that splits per-task
head_OH into AICPU→AICore-ready propagation and AICore-local
critical-path prep, and lands the surrounding swimlane phase model
cleanup that makes the new data legible:

  • receive_time field captured before per-task dcci + ack; stored
    as 32-bit start_time − receive_time delta in the existing AICore
    record (size unchanged).
  • setup bar auto-emitted on Worker View at level 1 already
    (no need for level 3+), directly before each kernel bar.
  • Flow arrows land at receive_time instead of start_time so
    the arrow→kernel-start gap equals local_setup visually.
  • Speculative early-dispatch (feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079) interaction: receive_time
    is re-stamped after the gate match so propagation absorbs the
    speculation overshoot and local_setup stays the pure
    ack-on-critical-path cost on both paths.
  • Scheduler-phase noise cleanup + terminology unification +
    enum cleanup (see below).
  • Worker View / Scheduler View rename + pid renumber so trace
    layout matches pipeline order.
  • DummyTask phase + Worker View DUMMY_T lanes so DAG fence /
    barrier nodes (no AICore presence) are visually present.

Why

Before this PR, head_OH lumped together NoC propagation latency
(hardware-bound, unfixable in software) and AICore-local dcci+ack
cost (software-tunable). Any "make head_OH smaller" investigation had
to guess which half it was targeting. With the split, the cold/warm
distribution is directly measurable from a single capture, and the
setup bar surfaces the cold/warm pattern at level 1 with no
additional instrumentation.

The follow-on cleanup landed in the same PR because the same
visualisations get used together — Resolve filtering, DummyTask
visibility, EarlyDispatch rename and Scan/Poll removal are all about
making the level-3/4 swimlane trace usable as a primary diagnosis tool
rather than a debug-only artifact.

What changed

Core: receive_time DFX field

  • AICore executor captures receive_time = get_sys_cnt_aicore()
    just after read_reg(DATA_MAIN_BASE) returns the new task_id,
    BEFORE dcci(payload, ENTIRE_DATA_CACHE) and
    write_reg(COND, MAKE_ACK_VALUE).
  • L2SwimlaneAicoreTaskRecord gains receive_to_start_cycles in
    the existing 4-byte tail, no size change, no extra cache line.
  • swimlane_converter.py exposes receive_time_us,
    local_setup_us, and (when AICPU records are joined)
    propagation_us per task.

Speculative early-dispatch (#1079) semantic alignment

For exec_payload->not_ready == 1 (a task was staged before its
dependencies resolved), the dcci ran during the doorbell-wait spin,
off the critical path. receive_time is re-stamped at the moment
the gate-wait exits on doorbell match, so:

  • propagation_us = receive_time − dispatch_ts absorbs the
    original NoC delivery AND any speculation overshoot.
  • local_setup_us stays the pure ack-on-critical-path cost on
    both paths (common path: dcci + ack; spec path: ack only — dcci
    already hidden behind the gate-wait).

The common-path emit code is unchanged.

Setup bar on Worker View

swimlane_converter.py emits a setup Perfetto event before each
kernel bar (same tid, name = setup, ts = receive_time,
dur = local_setup) at level 1. Cycle-or-shorter intervals are
filtered out (avoid invisible 0-width bars on warm cache).
Base_time tracking now includes the receive_time anchor so the
first cold task doesn't render at a negative offset.

Flow arrow target → receive_time

dependency and hb_violation flow arrows on Worker View now land
at receive_time_us instead of start_time_us, so the gap between
arrow tip and kernel start equals local_setup visually. Falls
back to start_time_us for old captures without a v3 receive_time.

--enable-swimlane-overhead opt-in flag

Adds the 8 Overhead Analysis counter tracks (per-engine
idle/ready/overhead + system all/has overhead) from PR #1039 to the
swimlane JSON when requested. Plumbed through pytest (conftest.py),
standalone (simpler_setup/scene_test.py argparse),
run_class_cases, _convert_case_swimlane, _run_swimlane_converter,
and the L3 subprocess forwarding so pytest and standalone share the
same surface.

Scheduler-phase model cleanup

  • Drop PR feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079 debug overlays: Scan (per-pass MMIO COND
    scan) and Poll (activity-fill attribution) were emitted at
    level >= SCHED_PHASES but carried no actionable signal —
    "scheduler is polling when there's nothing to do" is the steady
    state, not a finding. Total Perfetto event count on qwen3 level=4:
    51,381 → 2,952 (17× reduction). flush_activity_fill,
    fill_kind/fill_start/fill_end fields, and the per-iteration
    scan_cores++ instrumentation all removed.

  • Terminology unification: four renames touching enum, function,
    variable, comment, and converter color keys:

    Old New Why
    Prestage / try_speculative_prestage / prestage_queue EarlyDispatch / try_speculative_early_dispatch / early_dispatch_queue PR feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) #1079 internal jargon → the feature name
    Fanout (phase) Resolve "Fanout" overloaded the graph-theory term; the phase names the action (release/resolve successor fanin)
    on_mixed_task_complete on_task_complete Fires for every task that completes (MIX or single-subtask); the "mixed" modifier was historical
    bool mixed_complete task_complete Mirrors the above at the caller
  • DummyTask phase + Worker View DUMMY lanes: new
    L2SwimlaneSchedPhaseKind::DummyTask emitted once per dummy in
    dummy_drain (1-cycle wide, tasks_processed = task_token raw
    low 32 bits). Converter routes it to Worker View pid=4
    DUMMY_T{thread}
    lanes so DAG fence / barrier nodes (no AICore
    presence) are visually present. The accompanying Resolve bar on
    the Scheduler track covers the consumer-release work that follows.

  • Resolve quality: >= 1 µs filter drops the ~88% of tasks whose
    consumer-release walk is sub-microsecond — leaving only the
    broadcast / reduction Resolves that carry signal. tasks_processed
    now carries the real consumer-walk count plumbed back from
    on_task_complete (non-PROFILING return type uint32_t;
    PROFILING uses CompletionStats::fanout_edges). Resolve is also
    emitted from the dummy_drain path — previously a measurement
    blind spot (work happened, no bar).

pid renumber + Process rename

pid now reflects pipeline order (top → bottom in Perfetto), with
sort_index == pid for self-evident layout:

pid=1  AICPU Orchestrator   submit envelope (earliest)
pid=2  AICPU Scheduler      Complete/Dispatch/Release/Resolve/EarlyDispatch
pid=3  Scheduler View       AICPU-eye dispatch→finish per worker
pid=4  Worker View          AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest)

Was: pid=1 AICore View / pid=2 AICPU View. Renamed to
Worker View / Scheduler View since AICPU also serves as
worker for dummy tasks. No compat shim for old captures.

Enum cleanup

L2SwimlaneSchedPhaseKind collapsed from 8 (with reserved
Poll = 2 / Scan = 5 for legacy capture compat) to 6 sequential:

enum class L2SwimlaneSchedPhaseKind : uint32_t {
    Complete = 0,
    Dispatch = 1,
    Release = 2,
    Resolve = 3,
    EarlyDispatch = 4,
    DummyTask = 5,
};

a5 mirror

Renames mirrored to a5 tensormap_and_ringbuffer:
FanoutResolve, on_mixed_task_completeon_task_complete,
bool mixed_completebool task_complete. a5 has no PR #1079
speculative path so the not_ready re-stamp and Scan/Poll/Prestage
cleanup are a2a3-only.

Hot-path cost

  • One extra get_sys_cnt_aicore() per task in the AICore executor's
    task-arrived branch (single-cycle MSR read; negligible vs the
    existing per-task dcci(payload, ENTIRE_DATA_CACHE) that already
    costs ~50 cycles cold / ~0 warm).
  • not_ready == 1 path: one additional MSR read at the gate-exit
    (only fires on speculative-staged tasks).
  • 1 µs Resolve emit filter trades phase record bandwidth for signal —
    saves emit + storage at level ≥ 3.

Test plan

  • a2a3 onboard build + a5 onboard build via pip install --no-build-isolation .
  • ST passes: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask (3/3)
  • --enable-swimlane-overhead opt-in path verified (37 counter events when on; 0 when off)
  • qwen3 decode_layer level=1 and level=4 end-to-end PASS on a2a3 onboard
  • Setup bars visible per cold task (cold ~1 µs / warm ~200 ns clusters)
  • Resolve bars only present for >= 1 µs walks, count matches deps.json
  • DUMMY_T0 lane shows dummy bars on the dummy_task ST
  • No scan / poll events in current captures
  • pid mapping matches spec (pid=1 Orchestrator @ sort_index 1 → pid=4 Worker View @ sort_index 4)
  • CI: lint + build + ST gates

Schema notes

  • aicore_tasks JSON tuples grew from 5 to 6 columns (v3). The
    converter accepts both via *rest unpack — archived v2 JSON loads
    with receive_to_start_cycles defaulting to 0, no breakage.
  • aicpu_scheduler_phases records now include dummy_task and
    resolve kinds; the converter recognises both. Old captures with
    scan / poll / fanout / prestage strings still parse — those
    kinds are reported as-is but the runtime no longer emits them.
  • pid renumber is a hard cut-over: old .json captures opened in the
    current converter (or vice-versa) will display with wrong process
    names. Acceptable per the "latest tools display correctly" policy.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds receive_time timestamp capture to AICore task profiling in both a2a3 and a5 platforms. The runtime captures a new timing point after task ID is detected, computes a cycle delta (receive_to_start_cycles), exports it via JSON, and the host-side converter derives propagation and local-setup timing metrics. Statistics output now displays these new metrics.

Changes

AICore Task Receive-to-Start Timing Feature

Layer / File(s) Summary
Task record structure and function API
src/a2a3/platform/include/common/l2_swimlane_profiling.h, src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h, src/a5/platform/include/common/l2_swimlane_profiling.h, src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
L2SwimlaneAicoreTaskRecord gains receive_to_start_cycles field (replacing _pad) in both a2a3 and a5. l2_swimlane_aicore_record_task signature updated to accept receive_time parameter and associated Doxygen documentation.
Runtime receive_time capture in executors
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp, src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
Executors in both platforms and both build variants capture receive_time immediately after detecting new task ID from DATA_MAIN_BASE, before ACK and cache invalidation sequences.
Recording function delta computation
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h, src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
l2_swimlane_aicore_record_task implementations compute receive_to_start_cycles as the 32-bit cast of (start_time - receive_time) and write it to task record.
Executor calls to updated recording function
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp, src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
Call sites to l2_swimlane_aicore_record_task updated to pass newly captured receive_time in function parameter list.
Host JSON export schema for aicore_tasks
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp, src/a5/platform/shared/host/l2_swimlane_collector.cpp
export_swimlane_json() emits receive_to_start_cycles as an additional final element in each aicore_tasks tuple. Schema documentation updated to reflect new column layout.
Host converter v3 schema parsing and metric derivation
simpler_setup/tools/swimlane_converter.py
read_perf_data extended to parse v3 aicore_tasks format with receive_to_start_cycles column. Derives dispatch_time_us, receive_time_us, local_setup_us, and propagation_us from the delta when joining AICPU↔AICore records. Level-1 fallback updated to populate local_setup_us from optional column.
Task statistics printing with new propagation/local-setup metrics
simpler_setup/tools/swimlane_converter.py
print_task_statistics adds aggregation lists for propagation and local-setup metrics, conditionally populated when present. Table output extended with "Avg Prop(us)" and "Avg Local(us)" columns, rendering dashes for absent v3 data.
Test field validation updates
tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py
_REQUIRED_TASK_FIELDS extended to require receive_time_us and local_setup_us keys in perf task records.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#974: Directly modifies L2SwimlaneAicoreTaskRecord and l2_swimlane_aicore_record_task in the same profiling infrastructure; receive_time/receive_to_start_cycles changes build on or intersect with that PR's refactoring of task identity and record handling.
  • hw-native-sys/simpler#942: Changes the AICore task recording API signature and plumbing in l2_swimlane_collector_aicore.h across both platforms; this PR's addition of receive_time parameter depends on or is adjacent to that PR's API refactor.
  • hw-native-sys/simpler#985: Modifies simpler_setup/tools/swimlane_converter.py host converter logic for read_perf_data and JSON schema handling; this PR extends the same conversion path to handle v3 receive_to_start_cycles field and derived metrics.

Poem

🐰 A rabbit hops through cycles fine,
From receive to start, a timing line,
The delta dances, round and bright,
Profiling logs the AICore's flight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: adding an AICore receive_time DFX field and swimlane phase model cleanup, which directly reflects the primary objectives and changes throughout the PR.
Description check ✅ Passed The description is comprehensive and thoroughly related to the changeset, covering the receive_time field implementation, phase model cleanup, schema changes, and test verification across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a v3 schema for L2 swimlane profiling to split the head overhead into AICPU-to-AICore NoC propagation and AICore-local setup (dcci + ack) costs. This is achieved by capturing a new receive_time timestamp on the AICore before the ack write, storing it as a 32-bit delta (receive_to_start_cycles) in the L2SwimlaneAicoreTaskRecord struct, and updating the host collector, python converter, and validation tests to process and display these new metrics. Feedback on the changes highlights a potential issue where the derived receive_time can precede the calculated base_time_cycles, resulting in negative relative timestamps. A code suggestion is provided to track start_cycles - r2s_cycles in the base time calculation to prevent this.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +190 to +194
for row in aicore_rows:
# Column count varies (v2: 5, v3: 6); only the timing columns matter
# for base_time tracking.
_track(int(row[3]))
_track(int(row[4]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since receive_time is derived as start_time - receive_to_start_cycles, it can precede the calculated base_time_cycles if the earliest task's start_time is used as the base. This would result in a negative receive_time_us timestamp, which can cause rendering or validation issues in downstream trace analysis tools that expect non-negative relative timestamps.

To prevent negative timestamps, we should track the calculated receive_time (i.e., start_cycles - r2s_cycles) in the base_time_cycles calculation instead of just start_cycles.

Suggested change
for row in aicore_rows:
# Column count varies (v2: 5, v3: 6); only the timing columns matter
# for base_time tracking.
_track(int(row[3]))
_track(int(row[4]))
for row in aicore_rows:
# Column count varies (v2: 5, v3: 6); only the timing columns matter
# for base_time tracking.
start_cycles = int(row[3])
r2s_cycles = int(row[5]) if len(row) > 5 else 0
_track(start_cycles - r2s_cycles)
_track(int(row[4]))

@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/aicore-receive-time branch from e4bdf5a to 99b5088 Compare June 8, 2026 04:20

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/swimlane_converter.py`:
- Around line 103-119: Replace the Unicode minus sign (U+2212) with a standard
ASCII hyphen-minus (U+002D) in the module docstring where the expression
"start_time − receive_time" appears; locate the string in
simpler_setup/tools/swimlane_converter.py (the docstring describing aicore_tasks
v3 schema) and change "−" to "-" so the text reads "start_time - receive_time"
to avoid RUF002/clipboard issues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62ad57d5-5d19-498f-9096-412b834f6bc8

📥 Commits

Reviewing files that changed from the base of the PR and between 98849e8 and e4bdf5a.

📒 Files selected for processing (12)
  • simpler_setup/tools/swimlane_converter.py
  • src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
  • src/a5/platform/include/common/l2_swimlane_profiling.h
  • src/a5/platform/shared/host/l2_swimlane_collector.cpp
  • src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py

Comment on lines +103 to +119
"aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,
end_cycles, receive_to_start_cycles], ...],
"aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],
"aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],
"aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]
}

aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles
is a uint32 delta = AICore-side `start_time − receive_time`, where
receive_time is captured immediately after AICore's
`read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task
dcci + ack pair). Lets DFX split per-task head_OH into the
AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware-
bound) and the AICore-local dcci + ack cost (receive_time → start_time,
software-tunable). Archived v2 JSON without this column still parses;
the field is exposed as 0 for those.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Unicode minus sign in docstring.

Line 111 contains a Unicode MINUS SIGN (, U+2212) instead of HYPHEN-MINUS (-, U+002D) in the expression start_time − receive_time. This can cause copy-paste issues and is flagged by Ruff RUF002.

🔧 Proposed fix
-    is a uint32 delta = AICore-side `start_time − receive_time`, where
+    is a uint32 delta = AICore-side `start_time - receive_time`, where
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,
end_cycles, receive_to_start_cycles], ...],
"aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],
"aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],
"aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]
}
aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles
is a uint32 delta = AICore-side `start_time receive_time`, where
receive_time is captured immediately after AICore's
`read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task
dcci + ack pair). Lets DFX split per-task head_OH into the
AICPUAICore NoC propagation (dispatch_tsreceive_time, hardware-
bound) and the AICore-local dcci + ack cost (receive_timestart_time,
software-tunable). Archived v2 JSON without this column still parses;
the field is exposed as 0 for those.
"aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,
end_cycles, receive_to_start_cycles], ...],
"aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],
"aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],
"aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]
}
aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles
is a uint32 delta = AICore-side `start_time - receive_time`, where
receive_time is captured immediately after AICore's
`read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task
dcci + ack pair). Lets DFX split per-task head_OH into the
AICPUAICore NoC propagation (dispatch_tsreceive_time, hardware-
bound) and the AICore-local dcci + ack cost (receive_timestart_time,
software-tunable). Archived v2 JSON without this column still parses;
the field is exposed as 0 for those.
🧰 Tools
🪛 Ruff (0.15.15)

[warning] 111-111: Docstring contains ambiguous (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/tools/swimlane_converter.py` around lines 103 - 119, Replace
the Unicode minus sign (U+2212) with a standard ASCII hyphen-minus (U+002D) in
the module docstring where the expression "start_time − receive_time" appears;
locate the string in simpler_setup/tools/swimlane_converter.py (the docstring
describing aicore_tasks v3 schema) and change "−" to "-" so the text reads
"start_time - receive_time" to avoid RUF002/clipboard issues.

@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/aicore-receive-time branch 2 times, most recently from dafc19c to 3ebad46 Compare June 9, 2026 09:40
Core change
-----------
Capture per-task `receive_time` on AICore immediately after
`read_reg(DATA_MAIN_BASE)` returns a new task_id, BEFORE the per-task
`dcci(payload, ENTIRE_DATA_CACHE) + write_reg(COND, ACK)` pair that
precedes start_time. Stored as a 32-bit delta `start_time - receive_time`
in the AICore record (size unchanged, 2 records per cache line).

Splits the per-task head_OH into two physically distinct halves so DFX
can attribute each:

- `propagation_us` = receive_time - dispatch_ts: AICPU->AICore-ready
  delivery (NoC + any speculation overshoot).
- `local_setup_us` = start_time - receive_time: AICore-local
  critical-path prep (dcci + ack on the common path; ack-only on the
  speculative-hit path).

Semantic alignment with speculative early-dispatch (hw-native-sys#1079)
---------------------------------------------------------
For not_ready==1 (speculative pre-staging) the dcci ran during the
dependency-wait spin, off the critical path. receive_time is re-stamped
at the moment the doorbell match exits the gate, so propagation absorbs
both the original NoC delivery AND any speculation overshoot, while
local_setup stays the pure ack-on-critical-path cost. Common-path
emits unchanged.

Visualization
-------------
- Setup bar at level 1: swimlane_converter emits a `setup` sub-bar
  (ts=receive_time, dur=local_setup_us) directly before each kernel
  bar on Worker View. >=1-cycle filter suppresses warm-cache zero-width
  bars. Base_time tracking includes receive_time so the first cold task
  no longer renders at a negative offset.
- Flow arrows (deps.json -> Worker View) now land at receive_time
  instead of start_time so the gap between arrow tip and kernel start
  visually equals local_setup.
- New `--enable-swimlane-overhead` flag (pytest + standalone +
  conftest + L3 subprocess forwarding) opt-in for the 8 Overhead
  Analysis counter tracks from PR hw-native-sys#1039.

Scheduler-phase model cleanup
-----------------------------
- Drop PR hw-native-sys#1079 debug overlays: `Scan` (per-pass MMIO COND scan) and
  `Poll` (activity-fill attribution) were emitted at level>=3 but
  carried no actionable signal — "scheduler is polling when there's
  nothing to do" is the steady state, not a finding. Total Perfetto
  event count on qwen3 level=4 drops 51,381 -> 2,952.
- `Prestage` -> `EarlyDispatch` (PR hw-native-sys#1079 internal jargon -> the
  feature name from its PR title). enum value, function name
  (`try_speculative_early_dispatch`), queue field, and converter color
  key all renamed consistently.
- `Fanout` -> `Resolve`. The phase covers `on_task_complete`'s
  consumer-release walk (decrement consumer fanin, push newly-ready,
  ring speculative doorbells). "Fanout" overloaded the graph-theory
  term; "Resolve" names the action.
- `on_mixed_task_complete` -> `on_task_complete`. The function fires
  for every task that completes (MIX or single-subtask), the "mixed"
  modifier was historical.
- `bool mixed_complete` -> `task_complete` at the caller.
- New `DummyTask` phase kind, emitted once per dummy in `dummy_drain`.
  Converter routes it to Worker View pid=4 DUMMY_T{thread} lanes so
  DAG fence/barrier nodes (no AICore presence) are visually present.
  The accompanying Resolve bar covers the consumer-release work.
- Resolve emit >=1µs filter: drops the ~88% of tasks whose
  consumer-release walk is sub-microsecond, leaving only the
  broadcast/reduction Resolves that carry signal. `tasks_processed`
  now carries the real consumer-walk count plumbed back from
  `on_task_complete` (non-PROFILING return type uint32_t;
  PROFILING uses `CompletionStats::fanout_edges`).
- Resolve emit from dummy_drain too (previously a measurement blind
  spot — work happened, no bar).

pid renumber + Process rename
-----------------------------
pid is now in pipeline order (top -> bottom in Perfetto), with
sort_index == pid for self-evident layout:

  pid=1  AICPU Orchestrator   submit envelope (earliest)
  pid=2  AICPU Scheduler      Complete/Dispatch/Release/Resolve/EarlyDispatch
  pid=3  Scheduler View       AICPU-eye dispatch->finish per worker
  pid=4  Worker View          AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest)

Was: pid=1 AICore View / pid=2 AICPU View — renamed to Worker View /
Scheduler View since AICPU also serves as worker for dummy tasks. No
compat shim for old captures.

Enum cleanup
------------
`L2SwimlaneSchedPhaseKind` collapsed from 8 (with reserved Poll/Scan)
to 6 sequential:
  0 Complete, 1 Dispatch, 2 Release, 3 Resolve, 4 EarlyDispatch,
  5 DummyTask.

a5 mirror
---------
Renames mirrored to a5 tensormap_and_ringbuffer (Fanout/Resolve, on
_task_complete, mixed_complete) for cross-arch symmetry. a5 has no
PR hw-native-sys#1079 speculative path so the not_ready re-stamp and Scan/Poll/
Prestage cleanup are a2a3-only.

Hot-path cost
-------------
- One extra get_sys_cnt_aicore() per task in the AICore executor's
  task-arrived branch (cycle MSR read, negligible vs the existing
  per-task dcci(payload, ENTIRE_DATA_CACHE)).
- 1µs Resolve filter trades phase records for signal — saves emit
  bandwidth at level>=3.

Verified
--------
- a2a3 onboard build + a5 onboard build via `pip install`
- ST: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask
  + `--enable-swimlane-overhead` opt-in path
- qwen3 decode_layer level=1 and level=4 end-to-end PASS, swimlane
  artifacts reviewed (setup bars + Resolve bars + DUMMY lanes visible
  where expected; Scan/Poll bars absent)
- swimlane_converter `pid` mapping in output matches spec
@hw-native-sys-bot hw-native-sys-bot force-pushed the feat/aicore-receive-time branch from 3ebad46 to 5bebbd2 Compare June 22, 2026 03:53
@hw-native-sys-bot hw-native-sys-bot changed the title Add: AICore receive_time DFX field — split head_OH into NoC + dcci/ack Add: AICore receive_time DFX field + swimlane phase model cleanup Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants