From be420f7fb03b4758a669467fcff7aa9e1efada3e Mon Sep 17 00:00:00 2001
From: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Date: Mon, 8 Jun 2026 12:13:32 +0800
Subject: [PATCH] Add: AICore receive_time DFX field + swimlane phase model
 cleanup
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Core change
-----------
Capture per-task `receive_time` on AICore immediately after
`read_reg(DATA_MAIN_BASE)` returns a new task_id, BEFORE the per-task
`dcci(payload, ENTIRE_DATA_CACHE) + write_reg(COND, ACK)` pair that
precedes start_time. Stored as a 32-bit delta `start_time - receive_time`
in the AICore record (size unchanged, 2 records per cache line).

Splits the per-task head_OH into two physically distinct halves so DFX
can attribute each:

- `propagation_us` = receive_time - dispatch_ts: AICPU->AICore-ready
  delivery (NoC + any speculation overshoot).
- `local_setup_us` = start_time - receive_time: AICore-local
  critical-path prep (dcci + ack on the common path; ack-only on the
  speculative-hit path).

Semantic alignment with speculative early-dispatch (#1079)
---------------------------------------------------------
For not_ready==1 (speculative pre-staging) the dcci ran during the
dependency-wait spin, off the critical path. receive_time is re-stamped
at the moment the doorbell match exits the gate, so propagation absorbs
both the original NoC delivery AND any speculation overshoot, while
local_setup stays the pure ack-on-critical-path cost. Common-path
emits unchanged.

Visualization
-------------
- Setup bar at level 1: swimlane_converter emits a `setup` sub-bar
  (ts=receive_time, dur=local_setup_us) directly before each kernel
  bar on Worker View. >=1-cycle filter suppresses warm-cache zero-width
  bars. Base_time tracking includes receive_time so the first cold task
  no longer renders at a negative offset.
- Flow arrows (deps.json -> Worker View) now land at receive_time
  instead of start_time so the gap between arrow tip and kernel start
  visually equals local_setup.
- New `--enable-swimlane-overhead` flag (pytest + standalone +
  conftest + L3 subprocess forwarding) opt-in for the 8 Overhead
  Analysis counter tracks from PR #1039.

Scheduler-phase model cleanup
-----------------------------
- Drop PR #1079 debug overlays: `Scan` (per-pass MMIO COND scan) and
  `Poll` (activity-fill attribution) were emitted at level>=3 but
  carried no actionable signal — "scheduler is polling when there's
  nothing to do" is the steady state, not a finding. Total Perfetto
  event count on qwen3 level=4 drops 51,381 -> 2,952.
- `Prestage` -> `EarlyDispatch` (PR #1079 internal jargon -> the
  feature name from its PR title). enum value, function name
  (`try_speculative_early_dispatch`), queue field, and converter color
  key all renamed consistently.
- `Fanout` -> `Resolve`. The phase covers `on_task_complete`'s
  consumer-release walk (decrement consumer fanin, push newly-ready,
  ring speculative doorbells). "Fanout" overloaded the graph-theory
  term; "Resolve" names the action.
- `on_mixed_task_complete` -> `on_task_complete`. The function fires
  for every task that completes (MIX or single-subtask), the "mixed"
  modifier was historical.
- `bool mixed_complete` -> `task_complete` at the caller.
- New `DummyTask` phase kind, emitted once per dummy in `dummy_drain`.
  Converter routes it to Worker View pid=4 DUMMY_T{thread} lanes so
  DAG fence/barrier nodes (no AICore presence) are visually present.
  The accompanying Resolve bar covers the consumer-release work.
- Resolve emit >=1µs filter: drops the ~88% of tasks whose
  consumer-release walk is sub-microsecond, leaving only the
  broadcast/reduction Resolves that carry signal. `tasks_processed`
  now carries the real consumer-walk count plumbed back from
  `on_task_complete` (non-PROFILING return type uint32_t;
  PROFILING uses `CompletionStats::fanout_edges`).
- Resolve emit from dummy_drain too (previously a measurement blind
  spot — work happened, no bar).

pid renumber + Process rename
-----------------------------
pid is now in pipeline order (top -> bottom in Perfetto), with
sort_index == pid for self-evident layout:

  pid=1  AICPU Orchestrator   submit envelope (earliest)
  pid=2  AICPU Scheduler      Complete/Dispatch/Release/Resolve/EarlyDispatch
  pid=3  Scheduler View       AICPU-eye dispatch->finish per worker
  pid=4  Worker View          AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest)

Was: pid=1 AICore View / pid=2 AICPU View — renamed to Worker View /
Scheduler View since AICPU also serves as worker for dummy tasks. No
compat shim for old captures.

Enum cleanup
------------
`L2SwimlaneSchedPhaseKind` collapsed from 8 (with reserved Poll/Scan)
to 6 sequential:
  0 Complete, 1 Dispatch, 2 Release, 3 Resolve, 4 EarlyDispatch,
  5 DummyTask.

a5 mirror
---------
Renames mirrored to a5 tensormap_and_ringbuffer (Fanout/Resolve, on
_task_complete, mixed_complete) for cross-arch symmetry. a5 has no
PR #1079 speculative path so the not_ready re-stamp and Scan/Poll/
Prestage cleanup are a2a3-only.

Hot-path cost
-------------
- One extra get_sys_cnt_aicore() per task in the AICore executor's
  task-arrived branch (cycle MSR read, negligible vs the existing
  per-task dcci(payload, ENTIRE_DATA_CACHE)).
- 1µs Resolve filter trades phase records for signal — saves emit
  bandwidth at level>=3.

Verified
--------
- a2a3 onboard build + a5 onboard build via `pip install`
- ST: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask
  + `--enable-swimlane-overhead` opt-in path
- qwen3 decode_layer level=1 and level=4 end-to-end PASS, swimlane
  artifacts reviewed (setup bars + Resolve bars + DUMMY lanes visible
  where expected; Scan/Poll bars absent)
- swimlane_converter `pid` mapping in output matches spec
---
 conftest.py                                   |   8 +
 docs/dfx/l2-swimlane-profiling.md             |  19 +-
 .../2026-06-aicore-cold-start-warmup.md       |  86 +++-
 simpler_setup/scene_test.py                   |  33 +-
 simpler_setup/tools/swimlane_converter.py     | 368 ++++++++++++++----
 .../aicore/l2_swimlane_collector_aicore.h     |  15 +-
 .../include/common/l2_swimlane_profiling.h    |  54 ++-
 .../shared/host/l2_swimlane_collector.cpp     |  18 +-
 .../aicore/aicore_executor.cpp                |  12 +-
 .../aicore/aicore_executor.cpp                |  31 +-
 .../docs/RUNTIME_LOGIC.md                     |   2 +-
 .../docs/SUBMIT_BY_CLUSTER.md                 |   2 +-
 .../docs/device_log_profiling.md              |   2 +-
 .../runtime/aicore_completion_mailbox.h       |   2 +-
 .../runtime/pto_orchestrator.cpp              |   2 +-
 .../runtime/scheduler/pto_scheduler.h         |  43 +-
 .../scheduler/scheduler_completion.cpp        |  69 ++--
 .../runtime/scheduler/scheduler_context.h     |   2 +-
 .../runtime/scheduler/scheduler_dispatch.cpp  | 146 ++++---
 .../runtime/scheduler/scheduler_types.h       |   6 -
 .../aicore/l2_swimlane_collector_aicore.h     |  15 +-
 .../include/common/l2_swimlane_profiling.h    |  25 +-
 .../shared/host/l2_swimlane_collector.cpp     |   4 +-
 .../aicore/aicore_executor.cpp                |  12 +-
 .../aicore/aicore_executor.cpp                |  11 +-
 .../docs/RUNTIME_LOGIC.md                     |   2 +-
 .../docs/SUBMIT_BY_CLUSTER.md                 |   2 +-
 .../docs/device_log_profiling.md              |   2 +-
 .../runtime/aicore_completion_mailbox.h       |   2 +-
 .../runtime/pto_orchestrator.cpp              |   2 +-
 .../runtime/scheduler/pto_scheduler.h         |  12 +-
 .../scheduler/scheduler_completion.cpp        |  10 +-
 .../runtime/scheduler/scheduler_dispatch.cpp  |   4 +-
 .../dfx/l2_swimlane/_swimlane_validate.py     |  14 +-
 tests/ut/cpp/a2a3/test_wiring.cpp             |   6 +-
 tests/ut/cpp/a5/test_wiring.cpp               |   6 +-
 36 files changed, 745 insertions(+), 304 deletions(-)

diff --git a/conftest.py b/conftest.py
index 084911aa2..da4a114d9 100644
--- a/conftest.py
+++ b/conftest.py
@@ -187,6 +187,14 @@ def pytest_addoption(parser):
         default=False,
         help="Enable per-scope peak collection and emit <output_prefix>/scope_stats.jsonl (per-scope ring-fill peaks).",
     )
+    parser.addoption(
+        "--enable-swimlane-overhead",
+        action="store_true",
+        default=False,
+        help="Add the 8 Overhead Analysis counter tracks (per-engine "
+        "idle/ready/overhead + system all/has overhead) to the swimlane JSON. "
+        "Requires --enable-l2-swimlane + deps.json (re-run with --enable-dep-gen if absent).",
+    )
     parser.addoption(
         "--pto-isa-commit",
         action="store",
diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md
index 1420e652f..3cc22134e 100644
--- a/docs/dfx/l2-swimlane-profiling.md
+++ b/docs/dfx/l2-swimlane-profiling.md
@@ -34,12 +34,15 @@ available.
   [`dep_gen`](dep_gen.md) at post-process time; see
   [§3.5](#35-dependency-arrows-from-dep_gen).
 - **AICPU scheduler phases** — per-iteration breakdown into
-  `complete` / `dispatch`. Idle iterations no longer emit a record
-  on a2a3; the host tooling reconstructs idle spans from the gap
-  between consecutive work records on the same thread. Legacy
-  captures (and a5) may still carry `scan` / `idle` records — both
-  are silently dropped by the parser (idle is double-painted
-  by the gap reconstruction; `scan` was never emitted in a2a3).
+  `complete` / `dispatch` / `release` / `resolve` (Complete sub-bar
+  for the consumer-release walk) / `early_dispatch` (speculative
+  pre-staging). Idle iterations no longer emit a record on a2a3;
+  the host tooling reconstructs idle spans from the gap between
+  consecutive work records on the same thread. Legacy captures may
+  carry `scan` / `poll` / `idle` / `fanout` / `prestage` — current
+  a2a3 builds no longer emit them (PR #1079's Scan/Poll debug
+  overlay was removed; Fanout was renamed Resolve and now also
+  filters out <1 µs walks; Prestage was renamed EarlyDispatch).
 - **Orchestrator submit envelope** — one record per `submit_task()`
   / `alloc_tensors()` call covering the whole submit's
   `[start, end]` window (`orch_submit` phase). Per-sub-step
@@ -208,7 +211,7 @@ Phase records (per scheduler thread, level >= 3 for
 | Field | Meaning |
 | ----- | ------- |
 | `start_time_us` / `end_time_us` | Phase start / end timestamps in microseconds (reader-side cycle→µs conversion) |
-| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` (`scan` / `idle` may appear in legacy captures and a5; both are dropped by the parser). Orchestrator: `orch_submit` — one record per `submit_task()` / `alloc_tensors()` call spanning its full `[start, end]` window. Legacy per-sub-step strings (`orch_sync` / `orch_alloc` / `orch_params` / `orch_lookup` / `orch_insert` / `orch_fanin`) may appear in old captures. |
+| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` / `release` / `resolve` / `early_dispatch`. Legacy (still parsed for old captures): `scan` / `poll` / `fanout` / `prestage` / `idle` — current a2a3 builds no longer emit Scan/Poll (PR #1079 debug overlay removed) or Prestage (renamed to EarlyDispatch); Fanout was renamed to Resolve. Orchestrator: `orch_submit` — one record per `submit_task()` / `alloc_tensors()` call spanning its full `[start, end]` window. Legacy per-sub-step strings (`orch_sync` / `orch_alloc` / `orch_params` / `orch_lookup` / `orch_insert` / `orch_fanin`) may appear in old captures. |
 | `loop_iter` (scheduler) / `submit_idx` (orchestrator) | Iteration / submit-call counter for the producing thread |
 | `tasks_processed` (scheduler) / `task_id` (orchestrator) | Phase-specific union field |
 | `pop_hit` / `pop_miss` (dispatch only) | Ready-queue pop deltas since the previous dispatch emit |
@@ -557,7 +560,7 @@ bounded only by how fast the host drains — not by the per-core buffer
 sum.
 
 **Measured impact.** Hardware bench on a2a3 paged_attention_unroll
-Case1 with swimlane=4: rotation design delivers sched −4 µs / orch −19 µs
+Case1 with swimlane=4: rotation design delivers sched -4 µs / orch -19 µs
 vs the upstream/main baseline, comparable to the no-rotation predecessor
 (which had this PR's earlier commit; the rotation adds about 3 µs
 sched overhead per session as price for unbounded session length).
diff --git a/docs/investigations/2026-06-aicore-cold-start-warmup.md b/docs/investigations/2026-06-aicore-cold-start-warmup.md
index f656a2912..43a28d26b 100644
--- a/docs/investigations/2026-06-aicore-cold-start-warmup.md
+++ b/docs/investigations/2026-06-aicore-cold-start-warmup.md
@@ -45,7 +45,7 @@ cores per task, each task busy-waits 50 µs in `get_sys_cnt()`. Run on
 a2a3 onboard via `task-submit --device auto --device-num 1 --run "...
 --enable-l2-swimlane 2"`. Compared three runs, all with the
 batched-publish dispatch optimization (PR #989) and the eager swimlane
-head resolve (also in #989). Head-OH per task = `start_time_us −
+head resolve (also in #989). Head-OH per task = `start_time_us -
 dispatch_time_us`, joined per `(core_id, reg_task_id)` from the level-2
 `l2_swimlane_records.json`.
 
@@ -59,7 +59,7 @@ Per-task head-OH (us) across 72 cores:
 | Eager head resolve, no warmup | 0.36 | 2.61 | 5.62 | 0.35 | 0.39 | 0.23 |
 | Eager + warmup | **0.32** | **2.29** | 5.64 | 0.36 | 0.36 | 0.31 |
 
-Warmup vs eager-only: t0 avg −0.32 µs (12% reduction), t0 min −0.04 µs
+Warmup vs eager-only: t0 avg -0.32 µs (12% reduction), t0 min -0.04 µs
 (into noise), **t0 max unchanged** (within run-to-run variance).
 Steady-state t1–t3 unchanged. t0 max for the slowest cores is essentially
 the same as without warmup.
@@ -67,8 +67,8 @@ the same as without warmup.
 Per-core breakdown for the warmup run reveals the cost is **bimodal**:
 
 ```text
-Δ = t0 − avg(t1..t3) over 72 cores
-min = −0.03 µs  (warmup fully effective; t0 ≈ steady state)
+Δ = t0 - avg(t1..t3) over 72 cores
+min = -0.03 µs  (warmup fully effective; t0 ≈ steady state)
 p25 = 0.53 µs
 p50 = 1.77 µs
 p75 = 3.39 µs
@@ -139,10 +139,88 @@ property of the dispatch path that future profiling work should expect:
   the first task of a run should not assume software warmup will fix
   it.
 
+## Follow-up — receive_time DFX field + what propagation/local_setup actually mean
+
+PR #1004 adds an AICore-side `receive_time` capture marking the moment
+AICPU's full "task is ready to execute" signal landed on the core. With
+it the swimlane reports two new per-task fields:
+
+- `local_setup_us` = start_time - receive_time (AICore-side critical-path
+  prep between "task genuinely ready" and "kernel starts").
+- `propagation_us` = receive_time - dispatch_ts (AICPU→AICore-ready
+  delivery, including any scheduling-side waits).
+
+Where receive_time is stamped depends on the dispatch path:
+
+- **Common path (`not_ready == 0`)**: stamped immediately after
+  `read_reg(DATA_MAIN_BASE)` returns the new task_id — that read IS the
+  ready signal. `local_setup_us` ≈ dcci + ack cost on AICore.
+- **Speculative early-dispatch (`not_ready == 1`, PR #1079)**: the
+  initial task_id only stages the work; the real ready signal is the
+  doorbell on `DATA_MAIN_BASE` high32 that AICPU rings once the
+  dependencies resolve. receive_time is re-stamped at the moment the
+  doorbell match exits the gate-wait spin. The per-task dcci ran during
+  the spin (overlapped with the dependency wait, off the critical path),
+  so `local_setup_us` ≈ ack-only cost; `propagation_us` absorbs both the
+  original NoC delivery AND any speculation overshoot.
+
+The cleanest reading **only applies to IDLE-phase dispatches** (where
+AICPU writes the new task_id to a core whose RUNNING slot is empty —
+core was actually idle and polling). In that case:
+
+- `propagation_us` ≈ AICPU→AICore SoC delivery + AICore's next poll iter
+  picking up the new value. Hardware-bound at the ns scale per the
+  cold-start measurement above.
+- `local_setup_us` ≈ dcci + ack cost on AICore. Empirically 0–0.02 µs on
+  warm cache, 0.5–1 µs on cold first task.
+
+For **PENDING-phase dispatches** (AICPU pre-loads next task_id while
+RUNNING slot is still busy), `dispatch_ts` is captured at pre-load time,
+which can be tens of µs before the prior task FINs. AICore picks up the
+pending task immediately after prior FIN. So:
+
+- `propagation_us` here is **not** AICPU→AICore delay. It is mostly
+  "remaining exec time of the prior task at the moment of pre-load",
+  i.e. a measurement artifact of dual-buffer scheduling, not a physical
+  delivery delay.
+- `local_setup_us` interpretation still holds (dcci + ack on AICore for
+  the common path, ack-only for the speculative path).
+
+How to know which kind of dispatch a record represents: compare
+`dispatch_ts` against the prior task's `end_time` on the same core
+(both available in the swimlane records). `dispatch_ts ≥ prior.end` →
+IDLE-phase, propagation reflects real NoC delivery. `dispatch_ts <
+prior.end` → PENDING-phase, propagation is a pre-load offset.
+
+Level-3 capture on `vector_example` (5 tasks, all IDLE-dispatched):
+
+| Task | head_OH | propagation | local_setup |
+| ---- | ------: | ----------: | ----------: |
+| Cold first on core 5 | 2.12 µs | 1.20 µs | 0.92 µs |
+| Warm on core 5 | 0.16 µs | 0.16 µs | 0.00 µs |
+| Cold first on core 6 | 1.66 µs | 0.80 µs | 0.86 µs |
+| Warm on core 5 | 0.34 µs | 0.34 µs | 0.00 µs |
+| Warm on core 5 | 0.32 µs | 0.32 µs | 0.00 µs |
+
+Cold tasks pay both halves; warm tasks have local_setup at cycle-zero
+(dcci on already-invalidated cache is free) and propagation dominates.
+**This pattern is specific to the IDLE-phase regime.** Mid-workload
+traces (qwen3 decode_layer) include both IDLE and PENDING dispatches
+interleaved, and aggregating `propagation_us` without separating them
+yields confusing distributions — the "long tail" in such aggregates is
+usually PENDING pre-load offset, not slow NoC.
+
+Future warmup proposals targeting `local_setup_us` can now be evaluated
+directly against captured data. Proposals targeting `propagation_us`
+should first filter to IDLE-dispatched records to avoid attacking a
+measurement artifact.
+
 ## References
 
 - PR #989: dispatch path batched publish (where the eager swimlane head
   resolve lives).
+- PR #1004: AICore receive_time DFX field (the measurement that
+  confirmed the split above).
 - PR #988: `spmd_serial_chain_mix` example used for measurement.
 - Issue #545 comment #2: SPMD dispatch stagger.
 - `.claude/rules/...` DFX priority guidance.
diff --git a/simpler_setup/scene_test.py b/simpler_setup/scene_test.py
index 9bb8d486d..ff0d842b9 100644
--- a/simpler_setup/scene_test.py
+++ b/simpler_setup/scene_test.py
@@ -679,12 +679,18 @@ def _build_output_prefix(case_label: str) -> Path:
 def _run_swimlane_converter(
     input_path: Path | None = None,
     func_names_path: Path | None = None,
+    enable_overhead: bool = False,
 ) -> None:
     """Invoke the bundled swimlane converter as a subprocess.
 
     When ``input_path`` is given, the converter derives its output filename from
     the input's timestamp (see ``swimlane_converter._resolve_output_path``).
     Without it, the converter auto-selects the latest ``l2_swimlane_records_*.json``.
+
+    ``enable_overhead`` forwards the converter's ``--overhead`` flag — adds the
+    8 Overhead Analysis counter tracks (per-engine idle/ready/overhead + system
+    all/has overhead) under the AICPU Scheduler process. Needs deps.json; the
+    converter silently no-ops if deps is absent.
     """
     import logging  # noqa: PLC0415
     import subprocess  # noqa: PLC0415
@@ -695,6 +701,8 @@ def _run_swimlane_converter(
         cmd.append(str(input_path))
     if func_names_path is not None:
         cmd += ["--func-names", str(func_names_path)]
+    if enable_overhead:
+        cmd.append("--overhead")
     try:
         result = subprocess.run(cmd, check=True, capture_output=True, text=True)
         if result.stdout:
@@ -716,6 +724,7 @@ def _convert_case_swimlane(
     case_label: str,
     output_prefix: Path,
     callable_spec: dict | None = None,
+    enable_overhead: bool = False,
 ) -> None:
     """Post-case: invoke the swimlane converter on the perf file the runtime
     just wrote into ``<output_prefix>/l2_swimlane_records.json``. No diff/rename
@@ -736,7 +745,7 @@ def _convert_case_swimlane(
         safe_label = _sanitize_for_filename(case_label)
         func_names_path = _dump_name_map(mapping, output_prefix / f"name_map_{safe_label}.json")
 
-    _run_swimlane_converter(input_path=perf_file, func_names_path=func_names_path)
+    _run_swimlane_converter(input_path=perf_file, func_names_path=func_names_path, enable_overhead=enable_overhead)
 
 
 def _run_deps_viewer(
@@ -837,6 +846,7 @@ def run_class_cases(  # noqa: PLR0913 -- shared layer-5 entry; kwargs mirror CLI
     enable_dep_gen,
     enable_scope_stats,
     enable_device_log_timing=False,
+    enable_swimlane_overhead=False,
 ):
     """Execute a pre-filtered list of cases for one class (layers 5-6).
 
@@ -887,7 +897,12 @@ def run_class_cases(  # noqa: PLR0913 -- shared layer-5 entry; kwargs mirror CLI
             )
         finally:
             if enable_l2_swimlane:
-                _convert_case_swimlane(case_label, prefix, callable_spec=callable_spec)
+                _convert_case_swimlane(
+                    case_label,
+                    prefix,
+                    callable_spec=callable_spec,
+                    enable_overhead=enable_swimlane_overhead,
+                )
             if enable_dep_gen:
                 _graph_case_dep_gen(case_label, prefix, callable_spec=callable_spec)
             if enable_scope_stats:
@@ -1354,6 +1369,7 @@ def test_run(self, st_platform, st_worker, request):
         # device-log timing is cheap (PTO2_PROFILING markers, one block/round)
         # so unlike the heavy diagnostics it is NOT disabled when --rounds > 1.
         enable_device_log_timing = request.config.getoption("--enable-device-log-timing", default=False)
+        enable_swimlane_overhead = request.config.getoption("--enable-swimlane-overhead", default=False)
         if rounds > 1:
             if enable_l2_swimlane:
                 logger.warning("Profiling disabled: --rounds > 1")
@@ -1409,6 +1425,7 @@ def test_run(self, st_platform, st_worker, request):
             enable_dep_gen=enable_dep_gen,
             enable_scope_stats=enable_scope_stats,
             enable_device_log_timing=enable_device_log_timing,
+            enable_swimlane_overhead=enable_swimlane_overhead,
         )
 
     # ------------------------------------------------------------------
@@ -1507,6 +1524,15 @@ def run_module(module_name):  # noqa: PLR0912, PLR0915 -- CLI parsing + dispatch
             help="Enable per-scope peak collection and emit <output_prefix>/scope_stats/scope_stats.jsonl "
             "(per-scope ring-fill peaks).",
         )
+        parser.add_argument(
+            "--enable-swimlane-overhead",
+            action="store_true",
+            default=False,
+            help="Add the 8 Overhead Analysis counter tracks (per-engine "
+            "idle/ready/overhead + system all/has overhead) to the swimlane "
+            "JSON. Requires --enable-l2-swimlane + deps.json (re-run with "
+            "--enable-dep-gen if absent).",
+        )
         parser.add_argument(
             "--runtime",
             default=None,
@@ -1722,6 +1748,7 @@ def run_module(module_name):  # noqa: PLR0912, PLR0915 -- CLI parsing + dispatch
                                 enable_dep_gen=args.enable_dep_gen,
                                 enable_scope_stats=args.enable_scope_stats,
                                 enable_device_log_timing=args.enable_device_log_timing,
+                                enable_swimlane_overhead=args.enable_swimlane_overhead,
                             )
                             print("PASSED")
                         except Exception as e:  # noqa: BLE001
@@ -1767,6 +1794,8 @@ def _dispatch_test_phases_standalone(module_name, selected_by_cls, args):  # noq
         common.append("--enable-scope-stats")
     if args.enable_device_log_timing:
         common.append("--enable-device-log-timing")
+    if args.enable_swimlane_overhead:
+        common.append("--enable-swimlane-overhead")
 
     # ----- L3 phase: one subprocess per class (not per case).
     # The child's _create_standalone_worker allocates max(cls.CASES.device_count)
diff --git a/simpler_setup/tools/swimlane_converter.py b/simpler_setup/tools/swimlane_converter.py
index 55e95ee9f..d46dabb08 100644
--- a/simpler_setup/tools/swimlane_converter.py
+++ b/simpler_setup/tools/swimlane_converter.py
@@ -119,12 +119,23 @@ def read_perf_data(filepath):  # noqa: PLR0912, PLR0915
             "core_types": ["aic"|"aiv", ...],   # indexed by core_id
             "core_to_thread": [<int>, ...]      # optional (level >= 3)
           },
-          "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, end_cycles], ...],
+          "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles,
+                            end_cycles, receive_to_start_cycles], ...],
           "aicpu_tasks":  [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...],
           "aicpu_scheduler_phases":     [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ],
           "aicpu_orchestrator_phases":  [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ]
         }
 
+    aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles
+    is a uint32 delta = AICore-side `start_time - receive_time`, where
+    receive_time is captured immediately after AICore's
+    `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task
+    dcci + ack pair). Lets DFX split per-task head_OH into the
+    AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware-
+    bound) and the AICore-local dcci + ack cost (receive_time → start_time,
+    software-tunable). Archived v2 JSON without this column still parses;
+    the field is exposed as 0 for those.
+
     Returns a dict shaped for `generate_chrome_trace_json`,
     `print_task_statistics`, and `sched_overhead_analysis`: `tasks`,
     `aicpu_scheduler_phases`, `aicpu_orchestrator_phases`,
@@ -171,13 +182,19 @@ def read_perf_data(filepath):  # noqa: PLR0912, PLR0915
     # same PTO2 task_token_raw to the same core (SPMD over-subscription, MIX
     # cluster spread) each get their own reg_task_id, so this key is unique
     # per dispatch even when task_token_raw collides.
-    aicore_lookup: dict[tuple[int, int], tuple[int, int, int]] = {}
+    #
+    # `*rest` makes v2 rows (5 cols, no receive_to_start_cycles) and v3 rows
+    # (6 cols) both parse — archived JSON from before the receive_time split
+    # still loads with r2s_cycles defaulting to 0.
+    aicore_lookup: dict[tuple[int, int], tuple[int, int, int, int]] = {}
     for row in aicore_rows:
-        core_id, task_token_raw, reg_task_id, start_cycles, end_cycles = row
+        core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, *rest = row
+        r2s_cycles = int(rest[0]) if rest else 0
         aicore_lookup[(int(core_id), int(reg_task_id))] = (
             int(task_token_raw),
             int(start_cycles),
             int(end_cycles),
+            r2s_cycles,
         )
 
     # base_time = min non-zero timestamp across every stream that will be
@@ -189,9 +206,17 @@ def _track(v):
         if v > 0 and (base_time_cycles is None or v < base_time_cycles):
             base_time_cycles = v
 
-    for _, _, _, s, e in aicore_rows:
-        _track(int(s))
-        _track(int(e))
+    for row in aicore_rows:
+        # Column count varies (v2: 5, v3: 6); only the timing columns matter
+        # for base_time tracking. For v3, the per-task receive_time =
+        # start_cycles - receive_to_start_cycles is earlier than start_cycles
+        # itself; track it so the v3 setup-bar (ts = receive_time) doesn't
+        # land at a negative offset relative to the kernel-bar base.
+        start_c = int(row[3])
+        end_c = int(row[4])
+        r2s_c = int(row[5]) if len(row) > 5 else 0
+        _track(start_c - r2s_c)
+        _track(end_c)
     for _, _, d, f in aicpu_rows:
         _track(int(d))
         _track(int(f))
@@ -231,9 +256,15 @@ def _core_type(core_id):
             if ac is None:
                 unmatched_per_core[core_id] += 1
                 continue
-            task_token_raw, start_cycles, end_cycles = ac
+            task_token_raw, start_cycles, end_cycles, r2s_cycles = ac
             start_us = _to_us(start_cycles)
             end_us = _to_us(end_cycles)
+            dispatch_us = _to_us(int(dispatch_cycles))
+            # receive_to_start delta is in cycles; convert via the same
+            # cycles_to_us_factor that drives the absolute timestamps. No
+            # base_time subtraction — this is a delta.
+            local_setup_us = r2s_cycles * cycles_to_us_factor
+            receive_us = start_us - local_setup_us
             tasks.append(
                 {
                     "task_id": task_token_raw,
@@ -244,19 +275,24 @@ def _core_type(core_id):
                     "start_time_us": start_us,
                     "end_time_us": end_us,
                     "duration_us": end_us - start_us,
-                    "dispatch_time_us": _to_us(int(dispatch_cycles)),
+                    "dispatch_time_us": dispatch_us,
                     "finish_time_us": _to_us(int(finish_cycles)),
+                    "receive_time_us": receive_us,
+                    "local_setup_us": local_setup_us,
+                    "propagation_us": receive_us - dispatch_us,
                 }
             )
     elif level == 1:
         # AICORE_TIMING fallback: AICPU records are absent (complete_task
         # bypassed). The AICore stream alone is the source of truth.
         for row in aicore_rows:
-            core_id, task_token_raw, _reg_task_id, start_cycles, end_cycles = row
+            core_id, task_token_raw, _reg_task_id, start_cycles, end_cycles, *rest = row
+            r2s_cycles = int(rest[0]) if rest else 0
             core_id = int(core_id)
             task_token_raw = int(task_token_raw)
             start_us = _to_us(int(start_cycles))
             end_us = _to_us(int(end_cycles))
+            local_setup_us = r2s_cycles * cycles_to_us_factor
             tasks.append(
                 {
                     "task_id": task_token_raw,
@@ -269,6 +305,9 @@ def _core_type(core_id):
                     "duration_us": end_us - start_us,
                     "dispatch_time_us": 0.0,
                     "finish_time_us": 0.0,
+                    "receive_time_us": start_us - local_setup_us,
+                    "local_setup_us": local_setup_us,
+                    # propagation_us requires AICPU dispatch_ts; absent at level 1.
                 }
             )
 
@@ -537,6 +576,8 @@ def print_task_statistics(tasks, func_id_to_name=None):
             "durations": [],
             "head_overheads": [],
             "tail_overheads": [],
+            "propagations": [],  # dispatch_ts → AICore receive_time (NoC + FFTS)
+            "local_setups": [],  # receive_time → start_time (dcci + ack on AICore)
             "latencies": [],
             "total_exec_time": 0.0,
             "total_latency": 0.0,
@@ -567,6 +608,13 @@ def print_task_statistics(tasks, func_id_to_name=None):
             tail_overhead = finish_time - end_time
             func_stats[func_id]["tail_overheads"].append(tail_overhead)
 
+            # Head OH split (v3 schema only — falls back to absent when the
+            # AICore record came from a pre-receive_time build).
+            if "propagation_us" in task:
+                func_stats[func_id]["propagations"].append(task["propagation_us"])
+            if "local_setup_us" in task:
+                func_stats[func_id]["local_setups"].append(task["local_setup_us"])
+
             # Latency: finish_time_us - dispatch_time_us
             latency = finish_time - dispatch_time
             func_stats[func_id]["latencies"].append(latency)
@@ -580,15 +628,17 @@ def print_task_statistics(tasks, func_id_to_name=None):
             max_finish_time = max(max_finish_time, finish_time)
 
     # Print statistics
-    print("\n" + "=" * 110)
+    print("\n" + "=" * 140)
     print("Task Statistics by Function")
     print("  Exec = kernel time on AICore; Latency = dispatch->finish (incl. head OH + Exec + tail OH)")
-    print("=" * 110)
+    print("  Head OH split (v3): Prop = NoC propagation (dispatch_ts→AICore receive); Local = dcci+ack (receive→start)")
+    print("=" * 140)
     print(
         f"{'Func_ID':<8} {'Func_Name':<12} {'Count':>5}   {'Avg Exec(us)':>12}  "
-        f"{'Avg Latency(us)':>15}  {'Exec%':>6}   {'Avg Head OH(us)':>15}  {'Avg Tail OH(us)':>15}"
+        f"{'Avg Latency(us)':>15}  {'Exec%':>6}   {'Avg Head OH(us)':>15}  {'Avg Tail OH(us)':>15}  "
+        f"{'Avg Prop(us)':>12}  {'Avg Local(us)':>13}"
     )
-    print("-" * 110)
+    print("-" * 140)
 
     # Sort by func_id for consistent output
     total_count = 0
@@ -619,17 +669,25 @@ def print_task_statistics(tasks, func_id_to_name=None):
             sum(stats["tail_overheads"]) / len(stats["tail_overheads"]) if stats["tail_overheads"] else 0
         )
         avg_latency = stats["total_latency"] / count if count > 0 else 0
+        # `None` (not NaN) signals "no v3 receive_time data on this func" so
+        # the print line below renders a dash. NaN would force ruff's
+        # PLR0124 self-compare idiom.
+        avg_propagation = sum(stats["propagations"]) / len(stats["propagations"]) if stats["propagations"] else None
+        avg_local_setup = sum(stats["local_setups"]) / len(stats["local_setups"]) if stats["local_setups"] else None
 
         # Calculate execution ratio: total_exec_time / total_latency
         exec_ratio = (stats["total_exec_time"] / stats["total_latency"] * 100) if stats["total_latency"] > 0 else 0
 
+        prop_str = f"{avg_propagation:>12.2f}" if avg_propagation is not None else f"{'-':>12}"
+        local_str = f"{avg_local_setup:>13.2f}" if avg_local_setup is not None else f"{'-':>13}"
         print(
             f"{func_id:<8} {func_name:<12} {count:>5}   {avg_duration:>12.2f}  {avg_latency:>15.2f}  "
-            f"{exec_ratio:>5.1f}%   {avg_head_overhead:>15.2f}  {avg_tail_overhead:>15.2f}"
+            f"{exec_ratio:>5.1f}%   {avg_head_overhead:>15.2f}  {avg_tail_overhead:>15.2f}  "
+            f"{prop_str}  {local_str}"
         )
 
     # Print total row
-    print("-" * 110)
+    print("-" * 140)
 
     # Calculate total latency (sum of all latencies)
     total_latency_sum = sum(stats["total_latency"] for stats in func_stats.values())
@@ -656,11 +714,11 @@ def print_task_statistics(tasks, func_id_to_name=None):
     print("=" * 110)
 
 
-def build_overhead_counter_events(tasks, deps_edges, pid=3):  # noqa: PLR0912
+def build_overhead_counter_events(tasks, deps_edges, pid=2):  # noqa: PLR0912
     """Per-engine + system overhead counter tracks for the Perfetto trace.
 
     Emits 8 counter (``"ph":"C"``) series under the AICPU Scheduler process
-    (``pid=3``) — it is scheduler-overhead analysis, so it lives in the sched
+    (``pid=2``) — it is scheduler-overhead analysis, so it lives in the sched
     group — overlaid alongside the scheduler's own tracks and the AICore task
     bars to see, at every instant, why time is or is not wasted (see
     docs/dfx/sched-overhead-model.md):
@@ -755,7 +813,7 @@ def _u64(x):
         "oh_all_overhead": 9121,
         "oh_has_overhead": 9122,
     }
-    # No process metadata: pid=3 is the AICPU Scheduler process (named/sorted
+    # No process metadata: pid=2 is the AICPU Scheduler process (named/sorted
     # elsewhere). Emitting it here would override that — these counters just
     # join the scheduler group as extra tracks (names prefixed "oh_").
     events = []
@@ -825,10 +883,10 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
         core_to_thread: Optional list mapping core_id (index) to scheduler thread index (-1 = unassigned)
 
     Generates processes in the trace:
-        - pid=1 "AICore View": start_time_us to end_time_us (kernel execution)
-        - pid=2 "AICPU View": dispatch_time_us to finish_time_us (AICPU perspective)
-        - pid=3 "AICPU Scheduler": scheduler phase bars (l2_swimlane_level >= 3)
-        - pid=4 "AICPU Orchestrator": orchestrator phase bars or summary (l2_swimlane_level >= 4)
+        - pid=4 "Worker View": start_time_us to end_time_us (kernel execution)
+        - pid=3 "Scheduler View": dispatch_time_us to finish_time_us (AICPU perspective)
+        - pid=2 "AICPU Scheduler": scheduler phase bars (l2_swimlane_level >= 3)
+        - pid=1 "AICPU Orchestrator": orchestrator phase bars or summary (l2_swimlane_level >= 4)
     """
     if verbose:
         print("Generating Chrome Trace JSON...")
@@ -851,7 +909,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
     # Recover func_id for AICORE_TIMING (level=1) records, which the host
     # emits as func_id=-1. Resolve once here against dep_gen's per-task
     # kernel_ids[3] (picking the subslot by core_type) and write it back onto
-    # the task, so every downstream consumer — AICore View, AICPU View, and
+    # the task, so every downstream consumer — Worker View, Scheduler View, and
     # event-hints — sees the same real func_id. See
     # resolve_func_id_from_kernel_map() for the AIV0-vs-AIV1 tie-break and the
     # host-side contract.
@@ -865,20 +923,25 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
     # Step 2: Generate JSON events
     events = []
 
-    # Metadata event: Process names and sort order
-    # Display order: Orchestrator (pid=4) → Scheduler (pid=3) → AICPU View (pid=2) → AICore View (pid=1)
-    events.append({"args": {"name": "AICore View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 1})
-    events.append({"args": {"sort_index": 4}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 1})
+    # Metadata event: Process names and sort order.
+    # pid is renumbered in pipeline order (top → bottom in Perfetto):
+    #   pid=1  AICPU Orchestrator  (submits tasks — earliest)
+    #   pid=2  AICPU Scheduler     (pops ready, dispatches, completes)
+    #   pid=3  Scheduler View      (AICPU-eye view of each worker's dispatch→finish)
+    #   pid=4  Worker View         (physical AIC/AIV + DUMMY virtual workers — latest)
+    # sort_index intentionally equals pid so JSON ordering is self-evident.
+    events.append({"args": {"name": "Worker View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 4})
+    events.append({"args": {"sort_index": 4}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 4})
 
     # Check if any task has AICPU timestamps
     has_aicpu_data = any(task.get("dispatch_time_us", 0) >= 0 and task.get("finish_time_us", 0) > 0 for task in tasks)
 
     if has_aicpu_data:
         events.append(
-            {"args": {"name": "AICPU View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 2}
+            {"args": {"name": "Scheduler View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 3}
         )
         events.append(
-            {"args": {"sort_index": 3}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 2}
+            {"args": {"sort_index": 3}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 3}
         )
 
     # Metadata events: Thread names (one per core)
@@ -894,7 +957,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
         core_type_str = (core_type or "unknown").upper()
         thread_name = f"{core_type_str}_{core_id}"
         events.append(
-            {"args": {"name": thread_name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 1, "tid": tid}
+            {"args": {"name": thread_name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 4, "tid": tid}
         )
 
     # Duration events (Complete events "X")
@@ -912,6 +975,11 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
             for succ in succs:
                 fanin_map[succ].append(pred)
 
+    # One clock period at the AICore counter resolution. The device counter
+    # ticks at 50 MHz on a2a3 (20 ns) — sub-period local_setup intervals are
+    # noise (warm dcci is cycle-zero), so don't emit invisible zero-width bars.
+    _SETUP_MIN_US = 0.02
+
     for task in tasks:
         tid = core_to_tid[task["core_id"]]
         ts = task["start_time_us"]
@@ -932,6 +1000,35 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
         fanout_str = f"{len(fanout_ids)}: [" + ", ".join(format_task_display(x) for x in fanout_ids) + "]"
         fanin_str = f"{len(fanin_ids)}: [" + ", ".join(format_task_display(x) for x in fanin_ids) + "]"
 
+        # Setup bar (receive → start): visualises AICore-local critical-path
+        # prep (dcci + ack for the common path, ack-only for speculation-hit
+        # path — see docs/investigations/2026-06-aicore-cold-start-warmup.md).
+        # Same `name` across all setup bars → Perfetto auto-assigns one
+        # consistent color, distinct from kernel bars. Emit only when above
+        # one clock period; warm-cache dcci+ack lands sub-period and would
+        # render as an invisible zero-width slice.
+        local_setup_us = task.get("local_setup_us", 0.0) or 0.0
+        receive_time_us = task.get("receive_time_us")
+        if local_setup_us >= _SETUP_MIN_US and receive_time_us is not None:
+            events.append(
+                {
+                    "args": {
+                        "event-hint": f"setup: dcci+ack for Task:{tdisp} CoreId:{task['core_id']}",
+                        "local_setup_us": local_setup_us,
+                        "taskId": task["task_id"],
+                    },
+                    "cat": "event",
+                    "id": event_id,
+                    "name": "setup",
+                    "ph": "X",
+                    "pid": 4,
+                    "tid": tid,
+                    "ts": receive_time_us,
+                    "dur": local_setup_us,
+                }
+            )
+            event_id += 1
+
         events.append(
             {
                 "args": {
@@ -945,7 +1042,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                 "id": event_id,
                 "name": task_name,
                 "ph": "X",
-                "pid": 1,
+                "pid": 4,
                 "tid": tid,
                 "ts": ts,
                 "dur": dur,
@@ -956,7 +1053,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
         task_to_event_id[(task["task_id"], task["core_id"])] = event_id
         event_id += 1
 
-    # AICPU View duration events (dispatch_time to finish_time)
+    # Scheduler View duration events (dispatch_time to finish_time)
     # Assign overlapping tasks on the same core to different tids so Perfetto
     # renders each bar on its own row (Perfetto requires strict nesting on a tid).
     if has_aicpu_data:
@@ -992,7 +1089,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                 task_to_aicpu_tid[(task["task_id"], task["core_id"])] = tid
                 aicpu_tid_set.add(tid)
 
-        # Thread name metadata for AICPU View (one entry per unique tid used)
+        # Thread name metadata for Scheduler View (one entry per unique tid used)
         for core_id, base_tid in core_to_tid.items():
             ct_list = _core_aicpu_tasks.get(core_id)
             core_type_str = ct_list[0]["core_type"].upper() if ct_list else "unknown"
@@ -1005,7 +1102,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                         "cat": "__metadata",
                         "name": "thread_name",
                         "ph": "M",
-                        "pid": 2,
+                        "pid": 3,
                         "tid": base_tid,
                     }
                 )
@@ -1018,7 +1115,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                         "cat": "__metadata",
                         "name": "thread_name",
                         "ph": "M",
-                        "pid": 2,
+                        "pid": 3,
                         "tid": overflow_tid,
                     }
                 )
@@ -1052,7 +1149,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                     "id": event_id,
                     "name": task_name,
                     "ph": "X",
-                    "pid": 2,
+                    "pid": 3,
                     "tid": tid,
                     "ts": dispatch_us,
                     "dur": aicpu_dur,
@@ -1094,13 +1191,20 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
 
             for succ_task in task_map[succ_task_id]:
                 dst_tid = core_to_tid[succ_task["core_id"]]
-                dst_ts_start = succ_task["start_time_us"]
+                # Land the dependency arrow at the consumer's `receive_time` —
+                # the instant AICore saw the dispatch signal on DATA_MAIN_BASE.
+                # The gap between this anchor and the kernel's `start_time` is
+                # the per-task local_setup (dcci + ack), now visible as the
+                # "setup" sub-bar PR #1004 adds. Fall back to start_time for
+                # captures without v3 receive_time (older swimlane records or
+                # AICPU-only joins).
+                dst_ts_start = succ_task.get("receive_time_us") or succ_task["start_time_us"]
                 dst_event_id = task_to_event_id[(succ_task["task_id"], succ_task["core_id"])]
 
-                # Happens-before violation: producer outlived consumer's start.
-                # Real time order broke the data dependency the graph asserted;
-                # the runtime got away with it (consumer presumably re-read fresh
-                # data) but it's a smell — surface it.
+                # Happens-before violation: producer outlived consumer's
+                # receive. Real time order broke the data dependency the graph
+                # asserted; the runtime got away with it (consumer presumably
+                # re-read fresh data) but it's a smell — surface it.
                 hb_violated = src_ts_end > dst_ts_start
                 flow_name = "hb_violation" if hb_violated else "dependency"
                 if hb_violated:
@@ -1113,7 +1217,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                         "id": flow_id,
                         "name": flow_name,
                         "ph": "s",
-                        "pid": 1,
+                        "pid": 4,
                         "tid": src_tid,
                         "ts": flow_start_us,
                         "bind_id": src_event_id,
@@ -1126,7 +1230,7 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
                         "id": flow_id,
                         "name": flow_name,
                         "ph": "f",
-                        "pid": 1,
+                        "pid": 4,
                         "tid": dst_tid,
                         "ts": dst_ts_start,
                         "bp": "e",
@@ -1147,10 +1251,10 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
     if scheduler_phases:
         # Process metadata
         events.append(
-            {"args": {"name": "AICPU Scheduler"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 3}
+            {"args": {"name": "AICPU Scheduler"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 2}
         )
         events.append(
-            {"args": {"sort_index": 2}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 3}
+            {"args": {"sort_index": 2}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 2}
         )
 
         # Phase color mapping. The Perfetto sched lane only renders the
@@ -1162,11 +1266,11 @@ def generate_chrome_trace_json(  # noqa: PLR0912, PLR0913, PLR0915
         phase_colors = {
             "complete": "good",  # green
             "dispatch": "terrible",  # red
-            "poll": "yellow",  # completion-poll that scanned but retired nothing
             "release": "olive",  # deferred-release drain (on_task_release work)
-            "fanout": "vsync_highlight_color",  # nested in complete: fanin + doorbells
-            "scan": "thread_state_iowait",  # one running-core MMIO COND scan pass
-            "prestage": "rail_animation",  # speculative early-dispatch (Hook 1) staging
+            "resolve": "vsync_highlight_color",  # nested in complete: walk consumer list + doorbells
+            "early_dispatch": "rail_animation",  # speculative early-dispatch (Hook 1) staging
+            # Rendered separately on Worker View; this color is a fallback if it ever lands on Sched.
+            "dummy_task": "grey",
         }
 
         # Per-complete subtask-finish counts surface "how many AICore FINs
@@ -1220,6 +1324,28 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     continue
                 finishes_per_complete[id(t_comp)] += 1
 
+        # Worker View (pid=4) DUMMY lanes: one per scheduler thread, used by
+        # the `dummy_task` phase below to surface DAG fence / barrier nodes.
+        # Tids 19000+thread_idx keep them visually grouped after the physical
+        # AIC/AIV lanes (which sit at 10000+core_id*10).
+        DUMMY_TID_BASE = 19000  # noqa: N806
+        for thread_idx in range(len(scheduler_phases)):
+            events.append(
+                {
+                    "args": {"name": f"DUMMY_T{thread_idx}"},
+                    "cat": "__metadata",
+                    "name": "thread_name",
+                    "ph": "M",
+                    "pid": 4,
+                    "tid": DUMMY_TID_BASE + thread_idx,
+                }
+            )
+
+        # Dummy task X event width — start/end on the device coincide
+        # (identification is a single MSR read); render as a 0.02 µs sliver
+        # so Perfetto picks it up instead of collapsing it to a hairline.
+        DUMMY_BAR_MIN_DUR_US = 0.02  # noqa: N806
+
         for thread_idx, thread_records in enumerate(scheduler_phases):
             tid = 3000 + thread_idx
 
@@ -1230,7 +1356,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     "cat": "__metadata",
                     "name": "thread_name",
                     "ph": "M",
-                    "pid": 3,
+                    "pid": 2,
                     "tid": tid,
                 }
             )
@@ -1239,9 +1365,37 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
             # that otherwise hide inside an idle stretch (poll = completion-scan
             # that retired nothing; release = on_task_release drain). Genuine
             # spin emits no record and shows as a blank gap.
+            #
+            # `dummy_task` is special: it does NOT live on the sched track —
+            # it represents a DAG fence node briefly inhabiting the AICPU as a
+            # virtual worker, so we route it to Worker View (pid=4) DUMMY_T{thr}.
             for record in thread_records:
                 phase = record.get("phase", "unknown")
-                if phase not in ("complete", "dispatch", "poll", "release", "fanout", "scan", "prestage"):
+                if phase == "dummy_task":
+                    start_us = record["start_time_us"]
+                    end_us = record["end_time_us"]
+                    dur = max(end_us - start_us, DUMMY_BAR_MIN_DUR_US)
+                    task_id_low32 = record.get("tasks_processed", 0)
+                    events.append(
+                        {
+                            "args": {
+                                "phase": "dummy_task",
+                                "loop_iter": record.get("loop_iter", 0),
+                                "dummy_task_id_low32": task_id_low32,
+                                "event-hint": f"dummy(t{task_id_low32})",
+                            },
+                            "cat": "event",
+                            "cname": "grey",
+                            "name": f"dummy(t{task_id_low32})",
+                            "ph": "X",
+                            "pid": 4,
+                            "tid": DUMMY_TID_BASE + thread_idx,
+                            "ts": start_us,
+                            "dur": dur,
+                        }
+                    )
+                    continue
+                if phase not in ("complete", "dispatch", "release", "resolve", "early_dispatch"):
                     continue
                 start_us = record["start_time_us"]
                 end_us = record["end_time_us"]
@@ -1300,7 +1454,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "cname": phase_colors.get(phase, "generic_work"),
                         "name": display_name,
                         "ph": "X",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": tid,
                         "ts": start_us,
                         "dur": dur,
@@ -1314,8 +1468,9 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                 # also breaks Perfetto's rate calc (divide-by-zero → NULL).
                 # Track name carries thread index so it reads standalone
                 # even with the thread tree collapsed. Only complete/dispatch
-                # carry real queue depths; poll/release/fanout zero-fill them,
-                # so skip their counter samples to avoid spurious 0 dips.
+                # carry real queue depths; release/resolve/early_dispatch zero-
+                # fill them, so skip their counter samples to avoid spurious 0
+                # dips.
                 if phase not in ("complete", "dispatch"):
                     continue
                 if not depths_valid:
@@ -1327,7 +1482,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "cat": "queue",
                         "name": local_track_name,
                         "ph": "C",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": tid,
                         "ts": end_us,
                     }
@@ -1343,7 +1498,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "cat": "queue",
                         "name": "shared_ready_queue",
                         "ph": "C",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": 3999,
                         "ts": end_us,
                     }
@@ -1359,7 +1514,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                 "cat": "__metadata",
                 "name": "thread_name",
                 "ph": "M",
-                "pid": 3,
+                "pid": 2,
                 "tid": 3999,
             }
         )
@@ -1369,7 +1524,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                 "cat": "__metadata",
                 "name": "thread_sort_index",
                 "ph": "M",
-                "pid": 3,
+                "pid": 2,
                 "tid": 3999,
             }
         )
@@ -1384,10 +1539,10 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
         # Process metadata
         orch_process_label = f"AICPU {orchestrator_name}" if orchestrator_name else "AICPU Orchestrator"
         events.append(
-            {"args": {"name": orch_process_label}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 4}
+            {"args": {"name": orch_process_label}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 1}
         )
         events.append(
-            {"args": {"sort_index": 1}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 4}
+            {"args": {"sort_index": 1}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 1}
         )
 
         # Thread name metadata for each orchestrator thread
@@ -1395,7 +1550,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
             tid = 4000 + orch_idx
             name = f"Orch_{orch_idx}"
             events.append(
-                {"args": {"name": name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 4, "tid": tid}
+                {"args": {"name": name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 1, "tid": tid}
             )
 
         # Per-task orchestrator phase bars. As of PR-X the device folds
@@ -1413,6 +1568,37 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
             "orch_fanin": "rail_animation",
         }
 
+        # Build the regular-task and dummy-task id sets so the orch-phase loop
+        # below can identify which orch_submit envelopes belong to
+        # `alloc_tensors()` calls (no AICore record, no DummyTask phase — purely
+        # inline-completed by the orchestrator on host CPU). Those get a
+        # parallel ALLOC bar on Worker View pid=4 so the DAG node is visible.
+        regular_task_ids = {int(t.get("task_id", -1)) for t in tasks}
+        dummy_low32 = set()
+        if scheduler_phases:
+            for thread_records in scheduler_phases:
+                for rec in thread_records:
+                    if rec.get("phase") == "dummy_task":
+                        dummy_low32.add(rec.get("tasks_processed", 0))
+
+        # ALLOC lane metadata on Worker View (pid=4). One ALLOC_T{orch_idx}
+        # lane per orchestrator thread (mirror DUMMY_T{thread_idx} convention).
+        # Tids 18000+orch_idx keep ALLOC lanes grouped just before the DUMMY
+        # lanes (tids 19000+) so the Worker View reads:
+        #   AIC_* / AIV_* / ALLOC_T* / DUMMY_T*
+        ALLOC_TID_BASE = 18000  # noqa: N806
+        for orch_idx in range(len(orchestrator_phases)):
+            events.append(
+                {
+                    "args": {"name": f"ALLOC_T{orch_idx}"},
+                    "cat": "__metadata",
+                    "name": "thread_name",
+                    "ph": "M",
+                    "pid": 4,
+                    "tid": ALLOC_TID_BASE + orch_idx,
+                }
+            )
+
         for orch_idx, thread_records in enumerate(orchestrator_phases):
             tid = 4000 + orch_idx
             for record in thread_records:
@@ -1438,14 +1624,42 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     "cname": orch_phase_colors.get(phase, "generic_work"),
                     "name": label,
                     "ph": "X",
-                    "pid": 4,
+                    "pid": 1,
                     "tid": tid,
                     "ts": start_us,
                     "dur": dur,
                 }
                 events.append(event)
 
-    # AICPU View fanout arrows (duplicate AICore View flow events using AICPU timestamps)
+                # Classification: orch_submit envelopes whose task_id is NOT in
+                # the AICore record set AND NOT in DummyTask phases are
+                # alloc_tensors() calls. Render a parallel "alloc(...)" bar on
+                # Worker View pid=4 ALLOC_T{orch_idx} so DAG nodes for alloc
+                # tasks are visible (matches the DUMMY_T treatment).
+                if phase == "orch_submit" and task_id >= 0:
+                    is_regular = task_id in regular_task_ids
+                    task_low32 = task_id & 0xFFFFFFFF
+                    is_dummy = task_low32 in dummy_low32
+                    if not is_regular and not is_dummy:
+                        events.append(
+                            {
+                                "args": {
+                                    "phase": "alloc",
+                                    "task_id": task_id,
+                                    "event-hint": f"alloc({format_task_display(task_id)})",
+                                },
+                                "cat": "event",
+                                "cname": "olive",
+                                "name": f"alloc({format_task_display(task_id)})",
+                                "ph": "X",
+                                "pid": 4,
+                                "tid": ALLOC_TID_BASE + orch_idx,
+                                "ts": start_us,
+                                "dur": max(dur, 0.02),
+                            }
+                        )
+
+    # Scheduler View fanout arrows (duplicate Worker View flow events using AICPU timestamps)
     if has_aicpu_data:
         for task in tasks:
             src_finish_us = task.get("finish_time_us", 0)
@@ -1480,7 +1694,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": aicpu_flow_name,
                         "ph": "s",
-                        "pid": 2,
+                        "pid": 3,
                         "tid": src_tid,
                         "ts": src_finish_us - 0.01,
                     }
@@ -1493,7 +1707,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": aicpu_flow_name,
                         "ph": "f",
-                        "pid": 2,
+                        "pid": 3,
                         "tid": dst_tid,
                         "ts": dst_dispatch_us,
                         "bp": "e",
@@ -1604,7 +1818,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     "id": flow_id,
                     "name": "complete",
                     "ph": "s",
-                    "pid": 1,
+                    "pid": 4,
                     "tid": src_tid,
                     "ts": last_end_us - FLOW_EPSILON_US,
                 }
@@ -1615,7 +1829,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     "id": flow_id,
                     "name": "complete",
                     "ph": "f",
-                    "pid": 3,
+                    "pid": 2,
                     "tid": dst_tid,
                     "ts": comp["start_time_us"],
                     "bp": "e",
@@ -1699,7 +1913,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                             "id": flow_id,
                             "name": "complete→ready",
                             "ph": "s",
-                            "pid": 3,
+                            "pid": 2,
                             "tid": src_tid,
                             # Anchor inside the complete phase X event so
                             # clicking the complete block surfaces this arrow.
@@ -1712,7 +1926,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                             "id": flow_id,
                             "name": "complete→ready",
                             "ph": "f",
-                            "pid": 3,
+                            "pid": 2,
                             "tid": 3000 + d_thr,
                             "ts": d_us,
                             "bp": "e",
@@ -1770,14 +1984,14 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                 core_tid = core_to_tid[task["core_id"]]
                 aicpu_tid = task_to_aicpu_tid.get((task["task_id"], task["core_id"]), core_tid)
 
-                # Flow: scheduler DISPATCH → AICore View task start
+                # Flow: scheduler DISPATCH → Worker View task start
                 events.append(
                     {
                         "cat": "flow",
                         "id": flow_id,
                         "name": "dispatch",
                         "ph": "s",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": sched_tid,
                         "ts": dispatch_us,
                     }
@@ -1788,7 +2002,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": "dispatch",
                         "ph": "f",
-                        "pid": 1,
+                        "pid": 4,
                         "tid": core_tid,
                         "ts": task["start_time_us"],
                         "bp": "e",
@@ -1796,7 +2010,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                 )
                 flow_id += 1
 
-                # Flow: scheduler DISPATCH → AICPU View task start
+                # Flow: scheduler DISPATCH → Scheduler View task start
                 aicpu_eid = task_to_aicpu_event_id.get((task["task_id"], task["core_id"]))
                 events.append(
                     {
@@ -1804,7 +2018,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": "dispatch",
                         "ph": "s",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": sched_tid,
                         "ts": dispatch_us,
                     }
@@ -1814,7 +2028,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                     "id": flow_id,
                     "name": "dispatch",
                     "ph": "f",
-                    "pid": 2,
+                    "pid": 3,
                     "tid": aicpu_tid,
                     "ts": dispatch_us,
                     "bp": "e",
@@ -1883,7 +2097,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": flow_name,
                         "ph": "s",
-                        "pid": 4,
+                        "pid": 1,
                         "tid": orch_tid,
                         "ts": anchor_us,
                     }
@@ -1894,7 +2108,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float):
                         "id": flow_id,
                         "name": flow_name,
                         "ph": "f",
-                        "pid": 3,
+                        "pid": 2,
                         "tid": sched_tid,
                         "ts": dispatch_us,
                         "bp": "e",
diff --git a/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h b/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
index 39450f33c..79c9252f0 100644
--- a/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
+++ b/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
@@ -97,12 +97,20 @@ struct L2SwimlaneAicoreLocalState {
  *                        the same core — each dispatch needs its own AICore
  *                        record matched to its own AICPU record, which
  *                        task_token_raw alone cannot disambiguate.
- * @param start_time      Start timestamp (get_sys_cnt)
+ * @param receive_time    Timestamp captured immediately after `read_reg(DATA_MAIN_BASE)`
+ *                        returns the new task_id (before dcci+ack). Stored as a
+ *                        32-bit delta `start_time - receive_time` in the record;
+ *                        host recovers `receive_time = start_time - delta`.
+ *                        Lets DFX split head_OH into the unfixable
+ *                        AICPU→AICore NoC propagation (dispatch_ts → receive_time)
+ *                        and the AICore-local dcci+ack cost (receive_time → start_time).
+ * @param start_time      Start timestamp (get_sys_cnt) — post-dcci+ack, just
+ *                        before kernel `execute_task`.
  * @param end_time        End timestamp
  */
 __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_record_task(
     __gm__ L2SwimlaneActiveHead *head, L2SwimlaneAicoreLocalState *local, uint64_t task_token_raw, uint32_t reg_task_id,
-    uint64_t start_time, uint64_t end_time
+    uint64_t receive_time, uint64_t start_time, uint64_t end_time
 ) {
     // Re-fetch head channel each task; cheap relative to the
     // baseline `dcci(payload, ENTIRE_DATA_CACHE)` we already pay per task.
@@ -132,6 +140,9 @@ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_
     record->end_time = end_time;
     record->task_token_raw = task_token_raw;
     record->reg_task_id = reg_task_id;
+    // 32-bit delta; receive_time always precedes start_time on the same core
+    // (sys_cnt is monotonic per AICore), so the subtraction can never wrap.
+    record->receive_to_start_cycles = static_cast<uint32_t>(start_time - receive_time);
     local->slot_within_buf = slot + 1;
 
     // Flush record to GM so host can read it after the buffer is enqueued.
diff --git a/src/a2a3/platform/include/common/l2_swimlane_profiling.h b/src/a2a3/platform/include/common/l2_swimlane_profiling.h
index 892140c61..4ed94a588 100644
--- a/src/a2a3/platform/include/common/l2_swimlane_profiling.h
+++ b/src/a2a3/platform/include/common/l2_swimlane_profiling.h
@@ -158,16 +158,25 @@ static_assert(sizeof(L2SwimlaneAicpuTaskRecord) == 32, "L2SwimlaneAicpuTaskRecor
  *   reg_task_id, giving a clean 1:1 join even when multiple dispatches
  *   of the same task land on the same core.
  *
- * Layout: 24B identity/timing + 4B reg_task_id + 4B pad → 32B (half a
- * cache line). Two records pack into one cache line so AICore's per-task
- * store is at most a single line commit + dcci.
+ * Layout: 24B identity/timing + 4B reg_task_id + 4B receive_to_start delta →
+ * 32B (half a cache line). Two records pack into one cache line so AICore's
+ * per-task store is at most a single line commit + dcci.
+ *
+ * receive_to_start_cycles isolates the AICore-side dcci+ack cost from the
+ * AICPU→AICore NoC propagation. AICore captures receive_time right after
+ * `read_reg(DATA_MAIN_BASE)` returns the new task_id (before dcci+ack), and
+ * start_time after them. Host derives:
+ *   - receive_time  = start_time - receive_to_start_cycles
+ *   - propagation   = receive_time - dispatch_ts (AICPU view)
+ *   - local_setup   = receive_to_start_cycles    (dcci + ack)
+ * Delta fits in 32 bits at any platform clock (50 MHz @ 32-bit ≈ 85 s).
  */
 struct L2SwimlaneAicoreTaskRecord {
-    uint64_t start_time;      // Task start timestamp (get_sys_cnt)
-    uint64_t end_time;        // Task end timestamp
-    uint64_t task_token_raw;  // PTO2TaskId::raw — identity (NOT join key)
-    uint32_t reg_task_id;     // Per-core dispatch token — host join key vs AICPU stream
-    uint32_t _pad;
+    uint64_t start_time;               // Post-dcci+ack timestamp (kernel begins next)
+    uint64_t end_time;                 // Post-kernel timestamp
+    uint64_t task_token_raw;           // PTO2TaskId::raw — identity (NOT join key)
+    uint32_t reg_task_id;              // Per-core dispatch token — host join key vs AICPU stream
+    uint32_t receive_to_start_cycles;  // start_time - receive_time (AICore-local dcci + ack cost)
 } __attribute__((aligned(32)));
 
 static_assert(sizeof(L2SwimlaneAicoreTaskRecord) == 32, "L2SwimlaneAicoreTaskRecord must be 32B");
@@ -471,19 +480,22 @@ static_assert(sizeof(L2SwimlaneDataHeader) % 64 == 0, "L2SwimlaneDataHeader must
 enum class L2SwimlaneSchedPhaseKind : uint32_t {
     Complete = 0,  // Process completed tasks (fanin traversal)
     Dispatch = 1,  // Dispatch ready tasks to idle cores
-    // Surfaced for loop iterations that produce no Complete/Dispatch bar but
-    // still do real work — otherwise hidden inside the "idle" stretch.
-    Poll = 2,     // Completion poll scanned running cores but retired nothing
-    Release = 3,  // Deferred-release drain (on_task_release fanout/refcount work)
-    // Nested child kinds — rendered inside their parent Complete/Dispatch bar
-    // (Perfetto stacks time-contained events on the same lane).
-    Fanout = 4,  // on_mixed_task_complete: fanin release + consumer doorbells
-    Scan = 5,    // one check_running_cores pass — tasks_processed = cores' MMIO COND read
-    // Speculative early-dispatch (Hook 1): building + publishing a flagged
-    // producer's consumer's gated blocks. Its own bar so a long staging pass is
-    // NOT mislabeled Poll — it runs on the producer's owner thread and the
-    // payload-build cost is real (~us/subtask), so it must be visible directly.
-    Prestage = 6,  // try_speculative_prestage — tasks_processed = blocks staged
+    Release = 2,   // Deferred-release drain (on_task_release work)
+    // The following are nested inside their parent Complete/Dispatch bar
+    // (Perfetto stacks time-contained events on the same lane), except
+    // DummyTask which the converter routes to Worker View pid=4.
+    Resolve = 3,        // on_task_complete: walk consumer list, decrement fanin,
+                        // push newly-ready successors, ring doorbells for
+                        // speculative hits. tasks_processed = # consumers visited.
+    EarlyDispatch = 4,  // try_speculative_early_dispatch — speculative pre-staging
+                        // of a flagged producer's consumer's gated blocks.
+                        // tasks_processed = blocks staged this pass.
+    DummyTask = 5,      // Dummy (DAG fence / barrier node) identity marker
+                        // emitted once per dummy in dummy_drain. No AICore
+                        // presence; the converter renders this as a thin bar on
+                        // Worker View's DUMMY_T{thread} lane. tasks_processed =
+                        // task_token_raw low 32 bits. The surrounding Resolve
+                        // bar captures the consumer-release work that follows.
 };
 
 /** Index layout of the queue-depth snapshot arrays below: AIC=0, AIV=1, MIX=2.
diff --git a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
index 088ee645e..5f4459660 100644
--- a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
+++ b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
@@ -764,7 +764,7 @@ int L2SwimlaneCollector::export_swimlane_json() {
     // file size). Column order is documented in the schema comment at the top
     // of swimlane_converter.py's v2 reader.
     //
-    //   aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles]
+    //   aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles]
     //   aicpu_tasks:  [core_id, reg_task_id, dispatch_cycles, finish_cycles]
     {
         // copy_aicore_buffer already drops r.start_time == 0 slots when
@@ -776,7 +776,7 @@ int L2SwimlaneCollector::export_swimlane_json() {
             for (const auto &r : collected_aicore_records_[core_idx]) {
                 if (!first) outfile << ",";
                 outfile << "\n    [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", "
-                        << r.start_time << ", " << r.end_time << "]";
+                        << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]";
                 first = false;
                 total++;
             }
@@ -813,16 +813,14 @@ int L2SwimlaneCollector::export_swimlane_json() {
                 return "complete";
             case L2SwimlaneSchedPhaseKind::Dispatch:
                 return "dispatch";
-            case L2SwimlaneSchedPhaseKind::Poll:
-                return "poll";
             case L2SwimlaneSchedPhaseKind::Release:
                 return "release";
-            case L2SwimlaneSchedPhaseKind::Fanout:
-                return "fanout";
-            case L2SwimlaneSchedPhaseKind::Scan:
-                return "scan";
-            case L2SwimlaneSchedPhaseKind::Prestage:
-                return "prestage";
+            case L2SwimlaneSchedPhaseKind::Resolve:
+                return "resolve";
+            case L2SwimlaneSchedPhaseKind::EarlyDispatch:
+                return "early_dispatch";
+            case L2SwimlaneSchedPhaseKind::DummyTask:
+                return "dummy_task";
             }
             return "unknown";
         };
diff --git a/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp b/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
index 9c50b986e..30417ae8e 100644
--- a/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
+++ b/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
@@ -84,6 +84,16 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
         }
 
         {
+            // receive_time captures the instant DATA_MAIN_BASE returned a new
+            // task_id, BEFORE the ack write. Paired with start_time (captured
+            // after task_ptr resolve) it lets DFX split head_OH into the
+            // AICPU→AICore NoC propagation (dispatch_ts → receive_time,
+            // hardware-bound) and the AICore-local ack + task_ptr resolve
+            // (receive_time → start_time). host_build_graph has no per-task
+            // dcci so the local-setup span is naturally tighter than the
+            // tensormap_and_ringbuffer runtime; the field still records it.
+            uint64_t receive_time = get_sys_cnt_aicore();
+
             uint32_t actual_task_id = task_id;
             write_reg(RegId::COND, MAKE_ACK_VALUE(actual_task_id));
 
@@ -118,7 +128,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
                 // dispatch token coincide and a single value covers both.
                 l2_swimlane_aicore_record_task(
                     l2_swimlane_head, &l2_swimlane_local, static_cast<uint64_t>(actual_task_id),
-                    static_cast<uint32_t>(actual_task_id), start_time, end_time
+                    static_cast<uint32_t>(actual_task_id), receive_time, start_time, end_time
                 );
             }
 
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
index 54d3a997c..f10737969 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
@@ -133,6 +133,29 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
         }
 
         {
+            // receive_time marks the moment AICPU's full "task is ready to
+            // execute" signal landed on this core. Paired with start_time
+            // (captured after the per-task dcci + ack pair) it lets DFX split
+            // head_OH into the AICPU→AICore-ready propagation (dispatch_ts →
+            // receive_time, hardware + scheduling-bound) and the AICore-local
+            // critical-path prep (receive_time → start_time, software-tunable).
+            // Stored in the record as a 32-bit delta `start_time - receive_time`.
+            //
+            // For the common path (not_ready == 0) the new task_id on
+            // DATA_MAIN_BASE is itself the ready signal, so receive_time is
+            // stamped immediately and local_setup covers dcci + ack.
+            //
+            // For the speculative early-dispatch path (not_ready == 1) the
+            // dcci ran BEFORE the dependency-wait spin, so its cost is hidden
+            // behind the doorbell-wait — not on the critical path between
+            // "task genuinely ready" and "kernel begins". receive_time is
+            // re-stamped after the doorbell arrives, so propagation absorbs
+            // both the original NoC delivery AND any speculation overshoot,
+            // while local_setup stays the pure ack-on-critical-path cost. This
+            // makes local_setup the clean "AICore prep we can't hide" figure
+            // for both paths.
+            uint64_t receive_time = get_sys_cnt_aicore();
+
             uint32_t task_id = reg_val;  // Decode: register holds task_id directly
 
             // Select dual-buffer slot: same bit as AICPU used when writing payload
@@ -172,6 +195,12 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
                     write_reg(RegId::COND, AICORE_EXITED_VALUE);
                     break;
                 }
+                // Re-stamp receive_time at the moment the doorbell landed: the
+                // dcci above ran during the speculative-staging window
+                // (overlapped with the dependency wait, off the critical path).
+                // Propagation now absorbs the speculation overshoot; local_setup
+                // = start - receive stays the pure ack-on-critical-path cost.
+                receive_time = get_sys_cnt_aicore();
             }
 
             write_reg(RegId::COND, MAKE_ACK_VALUE(task_id));
@@ -221,7 +250,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
                 uint64_t end_time = get_sys_cnt_aicore();
                 uint64_t task_token_raw = exec_payload->local_context.async_ctx.task_token.raw;
                 l2_swimlane_aicore_record_task(
-                    l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, start_time, end_time
+                    l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, receive_time, start_time, end_time
                 );
             }
         }
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
index 38bbf0d53..a3cc143c6 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
@@ -501,7 +501,7 @@ Each scheduler thread runs a tight loop with two main phases:
 **Phase 1 — Completion Handling**:
 
 - Poll register `COND` on each managed core
-- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark
+- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark
 
 **Phase 2 — Dispatch**:
 
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
index fabdde65f..8cba7e90c 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
@@ -128,7 +128,7 @@ Queueing key is normalized resource shape (not raw slot label).
 1. Fanin release/readiness remains dependency-correct and graph-level.
 2. Two-stage completion:
    - `on_subtask_complete(task_id, subslot)`
-   - `on_mixed_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks`
+   - `on_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks`
 3. Downstream release is triggered once per mixed task completion, not once per subslot.
 
 ## 9. Executor Ownership and Numbering
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
index f2bd0aaf6..af661d440 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
@@ -109,7 +109,7 @@ The scheduler loop runs four phases each iteration. Each phase's time is accumul
 
 | Phase | What it does | Inline stats |
 | ----- | ------------ | ------------ |
-| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release |
+| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release |
 | **scan** | Updates the perf profiling header with latest scheduler state | — |
 | **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) |
 | **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — |
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
index d67626662..0f73a043a 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
@@ -39,7 +39,7 @@ static_assert(
 // observation flattened from a DeferredCompletionEntry. TASK_NORMAL_DONE
 // carries the slot_state pointer in `addr` so the consumer can finalize the
 // AsyncWaitEntry.slot_state binding for tasks whose conditions arrived
-// before the FIN thread saw mixed_complete. New kinds may be added in future
+// before the FIN thread saw task_complete. New kinds may be added in future
 // without growing the message — the `_pad[5]` slack is reserved for
 // kind-specific payload extension.
 #define MSG_KIND_CONDITION 0u
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
index 0445c798a..a72b91016 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
@@ -889,7 +889,7 @@ TaskOutputTensors PTO2OrchestratorState::alloc_tensors(const Arg &args) {
     if (prepared.slot_state != nullptr) {
         // Hidden alloc tasks complete inline in the orchestrator before any
         // consumer can exist, so they have no fanout to notify and no worker
-        // subtasks to retire. Running the full on_mixed_task_complete path
+        // subtasks to retire. Running the full on_task_complete path
         // would only pay unnecessary fanout_lock / traversal overhead here.
         // The generic slot initialization done in prepare_task() is still
         // required so scope_end can release the producer-side reference and
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
index f143bbaf3..f5213dca7 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
@@ -912,7 +912,7 @@ struct PTO2SchedulerState {
     // fanout). When that producer is thread-local (e.g. a 16-block AIV op filling one
     // thread's cores), the other threads never see the consumer and its blocks on
     // their cores can't pre-stage. The first claimer pushes the partially-staged
-    // consumer here; every idle thread's prestage pass pops one, stages a range onto
+    // consumer here; every idle thread's early_dispatch pass pops one, stages a range onto
     // ITS OWN cores (range-claim via next_block_idx), and re-pushes if blocks remain
     // — exactly mirroring how a partially-dispatched SPMD task is re-pushed to the
     // ready queue (scheduler_dispatch: pop -> claim -> re-push). A stale/released
@@ -974,7 +974,7 @@ struct PTO2SchedulerState {
     }
 
     // Collects consumers released via the speculative-doorbell path during a
-    // single on_mixed_task_complete fanout walk, so their dispatch_fanin
+    // single on_task_complete fanout walk, so their dispatch_fanin
     // propagation runs AFTER the walk — never between two siblings' doorbells.
     struct SpecReleaseSink {
         static constexpr int CAP = 32;
@@ -1149,16 +1149,22 @@ struct PTO2SchedulerState {
 
     /**
      * Two-stage completion: second stage.
-     * Called exactly once when all subtasks of a mixed task are done
-     * (i.e., on_subtask_complete returned true).
-     * Handles fanout notification, fanin release, and self-consumption check.
+     * Called exactly once when all subtasks of a task are done (i.e.,
+     * on_subtask_complete returned true). Walks the consumer (fanout) list,
+     * decrements each consumer's fanin, pushes newly-ready ones, and rings
+     * doorbells for speculative hits.
+     *
+     * Non-PROFILING returns the consumer-walk count (= edges traversed). The
+     * Resolve swimlane bar reads it to label the bar with how many successors
+     * actually got resolved. PROFILING returns the richer CompletionStats
+     * whose `fanout_edges` carries the same number.
      */
 #if PTO2_SCHED_PROFILING
     CompletionStats
 #else
-    void
+    uint32_t
 #endif
-    on_mixed_task_complete(
+    on_task_complete(
         PTO2TaskSlotState &slot_state,
 #if PTO2_SCHED_PROFILING
         int thread_idx,
@@ -1168,6 +1174,8 @@ struct PTO2SchedulerState {
     ) {
 #if PTO2_SCHED_PROFILING
         CompletionStats stats = {0, 0, 0, true};
+#else
+        uint32_t consumer_walk_count = 0;
 #endif
 #if PTO2_SCHED_PROFILING
         extern uint64_t g_sched_lock_cycle[], g_sched_fanout_cycle[];
@@ -1202,7 +1210,7 @@ struct PTO2SchedulerState {
         //
         // Safe on silicon: the producer's slot is already COMPLETED here — every
         // SPMD block has FIN'd AND dcci-flushed its output to HBM before
-        // on_mixed_task_complete runs — so a released consumer never reads stale
+        // on_task_complete runs — so a released consumer never reads stale
         // producer output. (Batching used to align the released wave, but pushed
         // every doorbell to the end of the walk, defeating the whole point of
         // speculative early-dispatch: minimal producer-end -> consumer-start.)
@@ -1221,6 +1229,7 @@ struct PTO2SchedulerState {
                 stats.tasks_enqueued++;
             }
 #else
+            consumer_walk_count++;
             release_fanin_and_check_ready(consumer_slot, local_bufs, &rel_sink);
 #endif
             current = current->next;
@@ -1234,6 +1243,8 @@ struct PTO2SchedulerState {
         g_sched_push_wait_cycle[thread_idx] += push_wait;
         PTO2_SCHED_CYCLE_LAP(g_sched_fanout_cycle[thread_idx]);
         return stats;
+#else
+        return consumer_walk_count;
 #endif
     }
 
@@ -1312,7 +1323,7 @@ struct PTO2SchedulerState {
 
 // try_inline_complete_locked: short-circuit NotDeferred completions seen during
 // drain so they don't grow entries[]. Defined here (not in pto_async_wait.h)
-// because PTO2SchedulerState's on_mixed_task_complete signature is only known
+// because PTO2SchedulerState's on_task_complete signature is only known
 // after its full definition above.
 //
 // When the deferred_release_slot_states[] buffer is full, drain it via
@@ -1321,10 +1332,12 @@ struct PTO2SchedulerState {
 // rates don't surface as ASYNC_WAIT_OVERFLOW errors.
 inline bool
 AsyncWaitList::try_inline_complete_locked(AsyncWaitList::DrainCompletionSink &sink, PTO2TaskSlotState &slot_state) {
+    // Return value (CompletionStats / consumer-walk count) discarded:
+    // async-wait drain path has no Resolve swimlane bar attached.
 #if PTO2_SCHED_PROFILING
-    sink.sched->on_mixed_task_complete(slot_state, sink.thread_idx, sink.local_bufs);
+    (void)sink.sched->on_task_complete(slot_state, sink.thread_idx, sink.local_bufs);
 #else
-    sink.sched->on_mixed_task_complete(slot_state, sink.local_bufs);
+    (void)sink.sched->on_task_complete(slot_state, sink.local_bufs);
 #endif
     if (*sink.deferred_release_count >= sink.deferred_release_capacity) {
         while (*sink.deferred_release_count > 0) {
@@ -1401,10 +1414,12 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete(
         }
 
         if (entry.normal_done && entry.waiting_completion_count <= 0) {
+            // Return value (CompletionStats / consumer-walk count) discarded:
+            // deferred-completion drain has no Resolve swimlane bar attached.
 #if PTO2_SCHED_PROFILING
-            sched->on_mixed_task_complete(*entry.slot_state, thread_idx, local_bufs);
+            (void)sched->on_task_complete(*entry.slot_state, thread_idx, local_bufs);
 #else
-            sched->on_mixed_task_complete(*entry.slot_state, local_bufs);
+            (void)sched->on_task_complete(*entry.slot_state, local_bufs);
 #endif
             // Drain deferred_release in place when the buffer fills — same
             // overflow-drain idiom used by complete_slot_task's inline path
@@ -1439,7 +1454,7 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete(
 
 #if PTO2_SCHED_PROFILING
 struct PTO2SchedProfilingData {
-    // Sub-phase cycle breakdown within on_mixed_task_complete
+    // Sub-phase cycle breakdown within on_task_complete
     uint64_t lock_cycle;           // lock_fanout + state store + unlock
     uint64_t fanout_cycle;         // fanout traversal
     uint64_t fanin_cycle;          // fanin traversal
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
index 91f89d7aa..37c73cb6a 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
@@ -132,7 +132,7 @@ void SchedulerContext::complete_slot_task(
         if (cond_count > 0) {
             // Publish "this task is deferred" before on_subtask_complete so the
             // acq_rel fetch_add inside on_subtask_complete makes the flag
-            // visible to whichever subtask sees mixed_complete=true (which may
+            // visible to whichever subtask sees task_complete=true (which may
             // be this thread or a later one).
             slot_state.any_subtask_deferred.store(true, std::memory_order_release);
 
@@ -147,17 +147,17 @@ void SchedulerContext::complete_slot_task(
         }
     }
 
-    bool mixed_complete = sched_->on_subtask_complete(slot_state);
+    bool task_complete = sched_->on_subtask_complete(slot_state);
 
 #if PTO2_PROFILING
     // Sub-block retire that did not finish the slot: record it so the poll
     // iteration becomes visible on the scheduler lane (the SPMD harvest tail).
-    if (!mixed_complete && l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
+    if (!task_complete && l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
         l2_swimlane.phase_subretire_count++;
     }
 #endif
 
-    if (mixed_complete && slot_state.payload != nullptr &&
+    if (task_complete && slot_state.payload != nullptr &&
         slot_state.any_subtask_deferred.load(std::memory_order_acquire)) {
         // Some subtask of this task registered conditions; finish the
         // registration by handing the slot_state off to the consumer.
@@ -168,7 +168,7 @@ void SchedulerContext::complete_slot_task(
         defer_completion_to_consumer = true;
     }
 
-    if (mixed_complete && !defer_completion_to_consumer) {
+    if (task_complete && !defer_completion_to_consumer) {
 #if PTO2_PROFILING
         if (is_dump_tensor_enabled()) {
             dump_tensors_for_task<PTO2_SUBTASK_SLOT_COUNT>(
@@ -183,25 +183,41 @@ void SchedulerContext::complete_slot_task(
         }
 #endif
 #if PTO2_PROFILING
-        // Time the fanout (fanin release + consumer doorbells) so it renders as
-        // a child bar nested inside this iteration's Complete bar — surfacing how
-        // much of a long complete is the 50-consumer walk + doorbell vs scan.
-        uint64_t fanout_t0 = (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0;
+        // Time Resolve (walk the consumer list, decrement each consumer's
+        // fanin, push the newly-ready ones, ring doorbells for speculative
+        // hits) so it renders as a child bar nested inside this iteration's
+        // Complete bar. The 1 µs floor below filters out the ~88% of tasks
+        // with 1-2 consumers (~500 ns Resolve) so only the long broadcast /
+        // reduction walks stand out on the lane.
+        uint64_t resolve_t0 = (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0;
 #endif
+        // [[maybe_unused]] silences -Werror=unused-but-set-variable on the
+        // profiling-flags-smoke build path where PTO2_PROFILING is OFF and
+        // the Resolve emit below is excluded.
+        [[maybe_unused]] uint32_t consumers_resolved = 0;
 #if PTO2_SCHED_PROFILING
         // SCHED_PROFILING variant takes thread_idx for its per-thread atomic
         // counter side-effects (g_sched_*_atomic_count[thread_idx], consumed
-        // by the otc_* log lines). Its return value is unused.
-        (void)sched_->on_mixed_task_complete(slot_state, thread_idx, local_bufs);
+        // by the otc_* log lines). It returns CompletionStats whose
+        // `fanout_edges` is the consumer-walk count.
+        consumers_resolved = sched_->on_task_complete(slot_state, thread_idx, local_bufs).fanout_edges;
 #else
-        sched_->on_mixed_task_complete(slot_state, local_bufs);
+        consumers_resolved = sched_->on_task_complete(slot_state, local_bufs);
 #endif
 #if PTO2_PROFILING
-        if (fanout_t0 != 0) {
-            l2_swimlane_aicpu_record_sched_phase(
-                thread_idx, L2SwimlaneSchedPhaseKind::Fanout, fanout_t0, get_sys_cnt_aicpu(),
-                l2_swimlane.sched_loop_count, /*tasks_processed=*/0
-            );
+        if (resolve_t0 != 0) {
+            uint64_t resolve_t1 = get_sys_cnt_aicpu();
+            // Filter: drop Resolve bars under 1 µs so the lane shows only
+            // resolves that did meaningful work (high consumer counts or
+            // doorbells). 50 cycles @ 50 MHz = 1 µs (PLATFORM_PROF_SYS_CNT_FREQ
+            // is the device sys-cnt frequency).
+            constexpr uint64_t RESOLVE_EMIT_MIN_CYCLES = PLATFORM_PROF_SYS_CNT_FREQ / 1'000'000;  // 1 µs
+            if (resolve_t1 - resolve_t0 >= RESOLVE_EMIT_MIN_CYCLES) {
+                l2_swimlane_aicpu_record_sched_phase(
+                    thread_idx, L2SwimlaneSchedPhaseKind::Resolve, resolve_t0, resolve_t1, l2_swimlane.sched_loop_count,
+                    consumers_resolved
+                );
+            }
         }
         l2_swimlane.phase_complete_count++;
 #endif
@@ -283,14 +299,6 @@ void SchedulerContext::check_running_cores_for_completion(
 ) {
 #if PTO2_SCHED_PROFILING
     auto &l2_swimlane = sched_l2_swimlane_[thread_idx];
-#endif
-#if PTO2_PROFILING
-    // Time this whole pass and count cores whose MMIO COND we actually read
-    // (skip-gated cores excluded) — emit a Scan bar so the per-pass MMIO cost
-    // and effective re-poll cadence are visible directly.
-    bool scan_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES;
-    uint64_t scan_t0 = scan_record ? get_sys_cnt_aicpu() : 0;
-    uint32_t scan_cores = 0;
 #endif
     CoreTracker &tracker = core_trackers_[thread_idx];
     auto running_core_states = tracker.get_all_running_cores();
@@ -317,9 +325,6 @@ void SchedulerContext::check_running_cores_for_completion(
         // --- Judgment phase: read register, derive transition ---
         // Use the precomputed cond_ptr (resolved once in handshake) to skip
         // the reg_offset switch and reg_addr addition on every poll.
-#if PTO2_PROFILING
-        scan_cores++;  // counts cores whose MMIO COND we actually read this pass
-#endif
         uint64_t reg_val = static_cast<uint64_t>(*core.cond_ptr);
         // ARM64 allows Device-nGnRnE -> Normal-cacheable load reorder; the
         // rmb() pins any AICore-published cacheable reads downstream of the
@@ -425,14 +430,6 @@ void SchedulerContext::check_running_cores_for_completion(
             made_progress = true;
         }
     }
-#if PTO2_PROFILING
-    if (scan_record && scan_cores > 0) {
-        l2_swimlane_aicpu_record_sched_phase(
-            thread_idx, L2SwimlaneSchedPhaseKind::Scan, scan_t0, get_sys_cnt_aicpu(),
-            sched_l2_swimlane_[thread_idx].sched_loop_count, scan_cores
-        );
-    }
-#endif
 }
 
 // =============================================================================
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
index 1ede7375d..3dcf4d7f5 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
@@ -275,7 +275,7 @@ class SchedulerContext {
     // cores spare, pre-stage the consumers of any RUNNING flagged producer onto
     // those cores with not_ready=1 (gated). Touches no dependency state — the
     // task is released by the doorbell at its normal ready-pop (Hook 2).
-    int32_t try_speculative_prestage(int32_t thread_idx);
+    int32_t try_speculative_early_dispatch(int32_t thread_idx);
 
     // Stage the already-claimed range [start, start+count) of consumer `c` onto
     // thread_idx's idle (RUNNING slot) then pending (gated-pending, promote-on-FIN)
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
index b4dadb1e0..64b054754 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
@@ -620,7 +620,7 @@ int32_t SchedulerContext::stage_consumer_blocks(
     CoreTracker &tracker = core_trackers_[thread_idx];
     // Stamp the real pre-stage time (NOT 0) so the swimlane shows these blocks
     // dispatched during the producer's run, not at trace start.
-    uint64_t prestage_ts = get_sys_cnt_aicpu();
+    uint64_t early_dispatch_ts = get_sys_cnt_aicpu();
     uint64_t my_cores[PTO2_SPEC_CORE_MASK_WORDS] = {0};  // cores this thread gated (for self-ring)
     int32_t staged = 0;
     int32_t block = start;
@@ -642,7 +642,7 @@ int32_t SchedulerContext::stage_consumer_blocks(
         if (n == 0) return;
         wmb();
         for (int i = 0; i < n; i++) {
-            publish_subtask_to_core(handles[i], prestage_ts);
+            publish_subtask_to_core(handles[i], early_dispatch_ts);
             int32_t cid = tracker.get_core_id_by_offset(handles[i].core_offset);
             sched_->spec_doorbell_table[cid].addr = handles[i].reg_addr;
             sched_->spec_doorbell_table[cid].token = handles[i].reg_task_id;
@@ -677,8 +677,8 @@ int32_t SchedulerContext::stage_consumer_blocks(
 // EVENT-DRIVEN by propagate_dispatch_fanin (a flagged producer's dispatch bumps its
 // consumers' dispatch_fanin; reaching fanin_count enqueues the consumer) — there is
 // no per-iteration PULL scan here anymore. This pass only DRAINS the queue.
-// Returns the number of blocks staged this pass (for the Prestage swimlane bar).
-int32_t SchedulerContext::try_speculative_prestage(int32_t thread_idx) {
+// Returns the number of blocks staged this pass (for the EarlyDispatch swimlane bar).
+int32_t SchedulerContext::try_speculative_early_dispatch(int32_t thread_idx) {
     constexpr int PTO2_EARLY_DISPATCH_DRAIN_MAX = 8;  // bounded pops per pass
     CoreTracker &tracker = core_trackers_[thread_idx];
     int32_t total_staged = 0;
@@ -874,18 +874,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
     if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
         capture_phase_end(phase_start_local, phase_start_shared);
     }
-    // Flush the open activity-fill run (Poll/Idle) as a phase record. Called
-    // when a Complete/Dispatch bar takes over, when the fill kind changes, and
-    // at loop exit — so every iteration is attributed and the lane has no blanks.
-    auto flush_activity_fill = [&]() {
-        if (l2_swimlane.fill_kind != 0) {
-            l2_swimlane_aicpu_record_sched_phase(
-                thread_idx, static_cast<L2SwimlaneSchedPhaseKind>(l2_swimlane.fill_kind - 1), l2_swimlane.fill_start,
-                l2_swimlane.fill_end, l2_swimlane.sched_loop_count, /*tasks_processed=*/0
-            );
-            l2_swimlane.fill_kind = 0;
-        }
-    };
 #endif
 
     // Wall-clock timestamp of the last completed task on this thread.
@@ -904,13 +892,11 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
         CYCLE_COUNT_START();
         l2_swimlane.sched_loop_count++;
         uint64_t _t0_phase = _t0;
-        // Activity-fill bookkeeping: an iteration that emits no Complete/Dispatch
-        // bar and scanned running cores (retired nothing) is filled with a Poll
-        // bar; genuine spin stays blank. Release is NOT classified here — it is a
-        // distinct operation emitted with its own span in the idle branch below.
-        uint64_t iter_start_ts = _t0;
-        uint64_t fill_complete_at_start = l2_swimlane.sched_complete_cycle;
-        bool emitted_phase_bar = false;
+        // Release is the only "no Complete/Dispatch bar" attribution we keep —
+        // emitted with its own span in the idle branch below. Iterations that
+        // only scan/poll show as blank gaps; the per-loop Poll/Scan bars (PR
+        // #1079 debug overlay) were removed since "scheduler is polling when
+        // there's nothing to do" carries no actionable signal.
         // Per-iter lazy shared-queue snapshot: first phase emit in this iter
         // pays the atomic-load cost, subsequent emits in the same iter reuse
         // the cached value. Reset here so we re-sample exactly once per iter
@@ -1021,7 +1007,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
                 _t0_phase = _t1;
                 l2_swimlane.phase_complete_count = 0;
                 l2_swimlane.phase_subretire_count = 0;
-                emitted_phase_bar = true;
             }
         }
 #endif
@@ -1060,10 +1045,52 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
             int dummy_got = sched_->dummy_ready_queue.pop_batch(dummy_batch, DUMMY_DRAIN_BATCH);
             for (int di = 0; di < dummy_got; di++) {
                 PTO2TaskSlotState &dummy_slot = *dummy_batch[di];
+
+                // ----- DummyTask phase: dummy "task" identity marker. --------
+                // The dummy has no AICore presence — start ≈ end (1 cycle
+                // wide, just "we identified it"). Converter renders this on
+                // Worker View's DUMMY_T{thread} lane so the DAG node is
+                // visually present. tasks_processed = task_token low 32 bits
+                // (= local_id within ring) so deps.json flow arrows can land.
+                // The Resolve work that follows is emitted separately below.
+#if PTO2_PROFILING
+                if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
+                    uint64_t dummy_marker_t = get_sys_cnt_aicpu();
+                    uint32_t dummy_id_low32 = static_cast<uint32_t>(dummy_slot.task->task_id.raw & 0xFFFFFFFFu);
+                    l2_swimlane_aicpu_record_sched_phase(
+                        thread_idx, L2SwimlaneSchedPhaseKind::DummyTask, dummy_marker_t, dummy_marker_t,
+                        sched_l2_swimlane_[thread_idx].sched_loop_count, dummy_id_low32
+                    );
+                }
+#endif
+
+                // ----- Resolve work: walk this dummy's consumer list. ------
+                // Same 1 µs filter as the main-path Resolve emit suppresses
+                // dummies whose consumer release runs sub-microsecond.
+#if PTO2_PROFILING
+                uint64_t dummy_resolve_t0 =
+                    (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0;
+#endif
+                // [[maybe_unused]] silences -Werror=unused-but-set-variable on
+                // the profiling-flags-smoke build path where PTO2_PROFILING is
+                // OFF and the Resolve emit below is excluded.
+                [[maybe_unused]] uint32_t dummy_consumers = 0;
 #if PTO2_SCHED_PROFILING
-                sched_->on_mixed_task_complete(dummy_slot, thread_idx, local_bufs);
+                dummy_consumers = sched_->on_task_complete(dummy_slot, thread_idx, local_bufs).fanout_edges;
 #else
-                sched_->on_mixed_task_complete(dummy_slot, local_bufs);
+                dummy_consumers = sched_->on_task_complete(dummy_slot, local_bufs);
+#endif
+#if PTO2_PROFILING
+                if (dummy_resolve_t0 != 0) {
+                    uint64_t dummy_resolve_t1 = get_sys_cnt_aicpu();
+                    constexpr uint64_t RESOLVE_EMIT_MIN_CYCLES = PLATFORM_PROF_SYS_CNT_FREQ / 1'000'000;  // 1 µs
+                    if (dummy_resolve_t1 - dummy_resolve_t0 >= RESOLVE_EMIT_MIN_CYCLES) {
+                        l2_swimlane_aicpu_record_sched_phase(
+                            thread_idx, L2SwimlaneSchedPhaseKind::Resolve, dummy_resolve_t0, dummy_resolve_t1,
+                            sched_l2_swimlane_[thread_idx].sched_loop_count, dummy_consumers
+                        );
+                    }
+                }
 #endif
                 // Dummy tasks have no subtasks to retire and no fanout pre-conditions
                 // beyond their own producers; release self-reference so the slot can
@@ -1104,31 +1131,31 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
             if (sched_->ready_queues[s].size() > 0 || local_bufs[s].count > 0) any_ready_work = true;
         }
 #if PTO2_PROFILING
-        bool prestage_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES;
-        uint64_t prestage_t0 = prestage_record ? get_sys_cnt_aicpu() : 0;
+        bool early_dispatch_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES;
+        uint64_t early_dispatch_t0 = early_dispatch_record ? get_sys_cnt_aicpu() : 0;
 #endif
-        // Skip speculative prestage under PMU: dispatch_ready_tasks already withholds
-        // PENDING dispatch when pmu_active to preserve single-issue PMU windows, and
-        // prestaging gated work into idle/pending slots would perturb the same windows.
-        [[maybe_unused]] int32_t prestaged = (pmu_active || any_ready_work) ? 0 : try_speculative_prestage(thread_idx);
+        // Skip speculative early-dispatch under PMU: dispatch_ready_tasks already
+        // withholds PENDING dispatch when pmu_active to preserve single-issue PMU
+        // windows, and staging gated work into idle/pending slots would perturb the
+        // same windows.
+        [[maybe_unused]] int32_t staged_count =
+            (pmu_active || any_ready_work) ? 0 : try_speculative_early_dispatch(thread_idx);
 #if PTO2_PROFILING
-        // Emit a Prestage bar so a staging-dominated iteration shows as Prestage,
-        // not mislabeled Poll (the cheap scan also ran, so the activity-fill would
-        // otherwise charge the whole us-scale staging to Poll).
-        if (prestage_record && prestaged > 0) {
+        // Emit an EarlyDispatch bar so a staging-dominated iteration is attributed
+        // to early-dispatch rather than disappearing into a blank gap.
+        if (early_dispatch_record && staged_count > 0) {
             l2_swimlane_aicpu_record_sched_phase(
-                thread_idx, L2SwimlaneSchedPhaseKind::Prestage, prestage_t0, get_sys_cnt_aicpu(),
-                sched_l2_swimlane_[thread_idx].sched_loop_count, static_cast<uint32_t>(prestaged)
+                thread_idx, L2SwimlaneSchedPhaseKind::EarlyDispatch, early_dispatch_t0, get_sys_cnt_aicpu(),
+                sched_l2_swimlane_[thread_idx].sched_loop_count, static_cast<uint32_t>(staged_count)
             );
             // prepare_block_for_dispatch bumped phase_dispatch_count while staging;
-            // those blocks belong to this Prestage bar, so clear the counter before it
-            // leaks into the next Dispatch bar.
+            // those blocks belong to this EarlyDispatch bar, so clear the counter
+            // before it leaks into the next Dispatch bar.
             sched_l2_swimlane_[thread_idx].phase_dispatch_count = 0;
-            emitted_phase_bar = true;  // suppress the Poll/idle fill for this iteration
         }
 #endif
 
-        // Second completion poll. dispatch_ready_tasks + try_speculative_prestage
+        // Second completion poll. dispatch_ready_tasks + try_speculative_early_dispatch
         // above can take several us in a busy window; a producer block that FINs
         // during them would otherwise wait for the NEXT iteration's top-of-loop
         // Phase-1 poll (the ~7us detection latency that delays a flagged
@@ -1191,7 +1218,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
                 l2_swimlane.phase_dispatch_count = 0;
                 l2_swimlane.pop_hit_at_last_emit = l2_swimlane.pop_hit;
                 l2_swimlane.pop_miss_at_last_emit = l2_swimlane.pop_miss;
-                emitted_phase_bar = true;
             }
         }
 #endif
@@ -1290,42 +1316,8 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
             }
 #endif
         }
-
-#if PTO2_PROFILING
-        // Activity-fill: attribute an iteration that emitted no Complete/Dispatch
-        // bar to the real operation that dominated it — Release (on_task_release
-        // drain) or Poll (completion scan that retired nothing). Genuine spin
-        // (neither did work) emits nothing and shows as a blank gap. Consecutive
-        // same-kind iterations coalesce into one bar.
-        if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) {
-            if (emitted_phase_bar) {
-                flush_activity_fill();
-            } else {
-                uint64_t dc = l2_swimlane.sched_complete_cycle - fill_complete_at_start;
-                // Poll = scanned running cores, retired nothing; else genuine
-                // spin (blank). Release is emitted separately in the idle branch.
-                uint32_t kc = (dc > 0) ? (static_cast<uint32_t>(L2SwimlaneSchedPhaseKind::Poll) + 1) : 0;
-                if (kc == 0) {
-                    flush_activity_fill();  // genuine spin — leave a blank gap
-                } else if (l2_swimlane.fill_kind == kc) {
-                    l2_swimlane.fill_end = _t1;
-                } else {
-                    flush_activity_fill();
-                    l2_swimlane.fill_kind = kc;
-                    l2_swimlane.fill_start = iter_start_ts;
-                    l2_swimlane.fill_end = _t1;
-                }
-            }
-        }
-#endif
     }
 
-#if PTO2_PROFILING
-    // Close the final open activity-fill run so the trailing idle/poll stretch
-    // before loop exit is attributed.
-    flush_activity_fill();
-#endif
-
     // Drain any entries left in the deferred-release batch. The in-loop flush
     // only fires on idle iterations and on buffer-full; a loop exit while the
     // last iteration made progress can leave entries un-released. Drop them
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
index ae42ec9fe..e4cd72272 100644
--- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
+++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
@@ -388,12 +388,6 @@ struct alignas(64) SchedL2SwimlaneCounters {
     uint64_t pop_miss{0};
     uint64_t pop_hit_at_last_emit{0};
     uint64_t pop_miss_at_last_emit{0};
-    // Activity-fill: coalesced segment covering iterations that emit no
-    // Complete/Dispatch bar, so the scheduler lane tiles with no blanks.
-    // fill_kind: 0 = none open, else L2SwimlaneSchedPhaseKind value + 1.
-    uint32_t fill_kind{0};
-    uint64_t fill_start{0};
-    uint64_t fill_end{0};
 #if PTO2_SCHED_PROFILING
     uint32_t phase_wiring_count{0};
     uint64_t complete_probe_count{0};
diff --git a/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h b/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
index a61f6a570..fbfcca415 100644
--- a/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
+++ b/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
@@ -98,12 +98,20 @@ struct L2SwimlaneAicoreLocalState {
  *                        the same core — each dispatch needs its own AICore
  *                        record matched to its own AICPU record, which
  *                        task_token_raw alone cannot disambiguate.
- * @param start_time      Start timestamp (get_sys_cnt)
+ * @param receive_time    Timestamp captured immediately after `read_reg(DATA_MAIN_BASE)`
+ *                        returns the new task_id (before dcci+ack). Stored as a
+ *                        32-bit delta `start_time - receive_time` in the record;
+ *                        host recovers `receive_time = start_time - delta`.
+ *                        Lets DFX split head_OH into the unfixable
+ *                        AICPU→AICore NoC propagation (dispatch_ts → receive_time)
+ *                        and the AICore-local dcci+ack cost (receive_time → start_time).
+ * @param start_time      Start timestamp (get_sys_cnt) — post-dcci+ack, just
+ *                        before kernel `execute_task`.
  * @param end_time        End timestamp
  */
 __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_record_task(
     __gm__ L2SwimlaneActiveHead *head, L2SwimlaneAicoreLocalState *local, uint64_t task_token_raw, uint32_t reg_task_id,
-    uint64_t start_time, uint64_t end_time
+    uint64_t receive_time, uint64_t start_time, uint64_t end_time
 ) {
     // Re-fetch head channel each task; cheap relative to the
     // baseline `dcci(payload, ENTIRE_DATA_CACHE)` we already pay per task.
@@ -133,6 +141,9 @@ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_
     record->end_time = end_time;
     record->task_token_raw = task_token_raw;
     record->reg_task_id = reg_task_id;
+    // 32-bit delta; receive_time always precedes start_time on the same core
+    // (sys_cnt is monotonic per AICore), so the subtraction can never wrap.
+    record->receive_to_start_cycles = static_cast<uint32_t>(start_time - receive_time);
     local->slot_within_buf = slot + 1;
 
     // Flush record to GM so host can read it after the buffer is enqueued.
diff --git a/src/a5/platform/include/common/l2_swimlane_profiling.h b/src/a5/platform/include/common/l2_swimlane_profiling.h
index 9b33101c4..ccdb0b545 100644
--- a/src/a5/platform/include/common/l2_swimlane_profiling.h
+++ b/src/a5/platform/include/common/l2_swimlane_profiling.h
@@ -158,16 +158,25 @@ static_assert(sizeof(L2SwimlaneAicpuTaskRecord) == 32, "L2SwimlaneAicpuTaskRecor
  *   reg_task_id, giving a clean 1:1 join even when multiple dispatches
  *   of the same task land on the same core.
  *
- * Layout: 24B identity/timing + 4B reg_task_id + 4B pad → 32B (half a
- * cache line). Two records pack into one cache line so AICore's per-task
- * store is at most a single line commit + dcci.
+ * Layout: 24B identity/timing + 4B reg_task_id + 4B receive_to_start delta →
+ * 32B (half a cache line). Two records pack into one cache line so AICore's
+ * per-task store is at most a single line commit + dcci.
+ *
+ * receive_to_start_cycles isolates the AICore-side dcci+ack cost from the
+ * AICPU→AICore NoC propagation. AICore captures receive_time right after
+ * `read_reg(DATA_MAIN_BASE)` returns the new task_id (before dcci+ack), and
+ * start_time after them. Host derives:
+ *   - receive_time  = start_time - receive_to_start_cycles
+ *   - propagation   = receive_time - dispatch_ts (AICPU view)
+ *   - local_setup   = receive_to_start_cycles    (dcci + ack)
+ * Delta fits in 32 bits at any platform clock (1 GHz @ 32-bit ≈ 4.3 s).
  */
 struct L2SwimlaneAicoreTaskRecord {
-    uint64_t start_time;      // Task start timestamp (get_sys_cnt)
-    uint64_t end_time;        // Task end timestamp
-    uint64_t task_token_raw;  // PTO2TaskId::raw — identity (NOT join key)
-    uint32_t reg_task_id;     // Per-core dispatch token — host join key vs AICPU stream
-    uint32_t _pad;
+    uint64_t start_time;               // Post-dcci+ack timestamp (kernel begins next)
+    uint64_t end_time;                 // Post-kernel timestamp
+    uint64_t task_token_raw;           // PTO2TaskId::raw — identity (NOT join key)
+    uint32_t reg_task_id;              // Per-core dispatch token — host join key vs AICPU stream
+    uint32_t receive_to_start_cycles;  // start_time - receive_time (AICore-local dcci + ack cost)
 } __attribute__((aligned(32)));
 
 static_assert(sizeof(L2SwimlaneAicoreTaskRecord) == 32, "L2SwimlaneAicoreTaskRecord must be 32B");
diff --git a/src/a5/platform/shared/host/l2_swimlane_collector.cpp b/src/a5/platform/shared/host/l2_swimlane_collector.cpp
index e07ad9ba0..b2dba1a85 100644
--- a/src/a5/platform/shared/host/l2_swimlane_collector.cpp
+++ b/src/a5/platform/shared/host/l2_swimlane_collector.cpp
@@ -764,7 +764,7 @@ int L2SwimlaneCollector::export_swimlane_json() {
     // file size). Column order is documented in the schema comment at the top
     // of swimlane_converter.py's v2 reader.
     //
-    //   aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles]
+    //   aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles]
     //   aicpu_tasks:  [core_id, reg_task_id, dispatch_cycles, finish_cycles]
     {
         // copy_aicore_buffer already drops r.start_time == 0 slots when
@@ -776,7 +776,7 @@ int L2SwimlaneCollector::export_swimlane_json() {
             for (const auto &r : collected_aicore_records_[core_idx]) {
                 if (!first) outfile << ",";
                 outfile << "\n    [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", "
-                        << r.start_time << ", " << r.end_time << "]";
+                        << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]";
                 first = false;
                 total++;
             }
diff --git a/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp b/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
index 76dfb97e4..65b30fb50 100644
--- a/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
+++ b/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
@@ -87,6 +87,16 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
         }
 
         {
+            // receive_time captures the instant DATA_MAIN_BASE returned a new
+            // task_id, BEFORE the ack write. Paired with start_time (captured
+            // after task_ptr resolve) it lets DFX split head_OH into the
+            // AICPU→AICore NoC propagation (dispatch_ts → receive_time,
+            // hardware-bound) and the AICore-local ack + task_ptr resolve
+            // (receive_time → start_time). host_build_graph has no per-task
+            // dcci so the local-setup span is naturally tighter than the
+            // tensormap_and_ringbuffer runtime; the field still records it.
+            uint64_t receive_time = get_sys_cnt_aicore();
+
             uint32_t actual_task_id = task_id;
             write_reg(RegId::COND, MAKE_ACK_VALUE(actual_task_id));
 
@@ -121,7 +131,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
                 uint64_t end_time = get_sys_cnt_aicore();
                 l2_swimlane_aicore_record_task(
                     l2_swimlane_head, &l2_swimlane_local, static_cast<uint64_t>(actual_task_id),
-                    static_cast<uint32_t>(actual_task_id), start_time, end_time
+                    static_cast<uint32_t>(actual_task_id), receive_time, start_time, end_time
                 );
             }
 
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp b/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
index c252dd5ac..21e79b3ed 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
+++ b/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
@@ -130,6 +130,15 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
         }
 
         {
+            // receive_time is captured the instant DATA_MAIN_BASE returned a
+            // new task_id, BEFORE the per-task dcci + ack pair. Paired with
+            // start_time (captured after dcci + ack) it lets DFX split head_OH
+            // into the AICPU→AICore NoC propagation (dispatch_ts → receive_time,
+            // hardware-bound) and the AICore-local dcci+ack cost
+            // (receive_time → start_time, software-tunable). Stored in the
+            // record as a 32-bit delta `start_time - receive_time`.
+            uint64_t receive_time = get_sys_cnt_aicore();
+
             uint32_t task_id = reg_val;  // Decode: register holds task_id directly
 
             // First-task lazy resolve of the rotation channel.
@@ -172,7 +181,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in
                 uint64_t end_time = get_sys_cnt_aicore();
                 uint64_t task_token_raw = exec_payload->local_context.async_ctx.task_token.raw;
                 l2_swimlane_aicore_record_task(
-                    l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, start_time, end_time
+                    l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, receive_time, start_time, end_time
                 );
             }
 
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
index 76f73da1b..e34c46ed8 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
+++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
@@ -501,7 +501,7 @@ Each scheduler thread runs a tight loop with two main phases:
 **Phase 1 — Completion Handling**:
 
 - Poll register `COND` on each managed core
-- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark
+- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark
 
 **Phase 2 — Dispatch**:
 
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
index fabdde65f..8cba7e90c 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
+++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
@@ -128,7 +128,7 @@ Queueing key is normalized resource shape (not raw slot label).
 1. Fanin release/readiness remains dependency-correct and graph-level.
 2. Two-stage completion:
    - `on_subtask_complete(task_id, subslot)`
-   - `on_mixed_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks`
+   - `on_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks`
 3. Downstream release is triggered once per mixed task completion, not once per subslot.
 
 ## 9. Executor Ownership and Numbering
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
index f2bd0aaf6..af661d440 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
+++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
@@ -109,7 +109,7 @@ The scheduler loop runs four phases each iteration. Each phase's time is accumul
 
 | Phase | What it does | Inline stats |
 | ----- | ------------ | ------------ |
-| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release |
+| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release |
 | **scan** | Updates the perf profiling header with latest scheduler state | — |
 | **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) |
 | **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — |
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
index 5c07462ea..f914bfddf 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
+++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
@@ -39,7 +39,7 @@ static_assert(
 // observation flattened from a DeferredCompletionEntry. TASK_NORMAL_DONE
 // carries the slot_state pointer in `addr` so the consumer can finalize the
 // AsyncWaitEntry.slot_state binding for tasks whose conditions arrived
-// before the FIN thread saw mixed_complete. New kinds may be added in future
+// before the FIN thread saw task_complete. New kinds may be added in future
 // without growing the message — the `_pad[5]` slack is reserved for
 // kind-specific payload extension.
 #define MSG_KIND_CONDITION 0u
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
index e95234757..f3e5359ee 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
+++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
@@ -894,7 +894,7 @@ TaskOutputTensors PTO2OrchestratorState::alloc_tensors(const Arg &args) {
     if (prepared.slot_state != nullptr) {
         // Hidden alloc tasks complete inline in the orchestrator before any
         // consumer can exist, so they have no fanout to notify and no worker
-        // subtasks to retire. Running the full on_mixed_task_complete path
+        // subtasks to retire. Running the full on_task_complete path
         // would only pay unnecessary fanout_lock / traversal overhead here.
         // The generic slot initialization done in prepare_task() is still
         // required so scope_end can release the producer-side reference and
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
index 67ce8afa6..e5123fb2a 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
+++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
@@ -986,7 +986,7 @@ struct PTO2SchedulerState {
 #else
     void
 #endif
-    on_mixed_task_complete(
+    on_task_complete(
         PTO2TaskSlotState &slot_state,
 #if PTO2_SCHED_PROFILING
         int thread_idx,
@@ -1124,9 +1124,9 @@ struct PTO2SchedulerState {
 inline bool
 AsyncWaitList::try_inline_complete_locked(AsyncWaitList::DrainCompletionSink &sink, PTO2TaskSlotState &slot_state) {
 #if PTO2_SCHED_PROFILING
-    sink.sched->on_mixed_task_complete(slot_state, sink.thread_idx, sink.local_bufs);
+    sink.sched->on_task_complete(slot_state, sink.thread_idx, sink.local_bufs);
 #else
-    sink.sched->on_mixed_task_complete(slot_state, sink.local_bufs);
+    sink.sched->on_task_complete(slot_state, sink.local_bufs);
 #endif
     if (*sink.deferred_release_count >= sink.deferred_release_capacity) {
         while (*sink.deferred_release_count > 0) {
@@ -1204,9 +1204,9 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete(
 
         if (entry.normal_done && entry.waiting_completion_count <= 0) {
 #if PTO2_SCHED_PROFILING
-            sched->on_mixed_task_complete(*entry.slot_state, thread_idx, local_bufs);
+            sched->on_task_complete(*entry.slot_state, thread_idx, local_bufs);
 #else
-            sched->on_mixed_task_complete(*entry.slot_state, local_bufs);
+            sched->on_task_complete(*entry.slot_state, local_bufs);
 #endif
             if (deferred_release_count >= deferred_release_capacity) {
                 while (deferred_release_count > 0) {
@@ -1236,7 +1236,7 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete(
 
 #if PTO2_SCHED_PROFILING
 struct PTO2SchedProfilingData {
-    // Sub-phase cycle breakdown within on_mixed_task_complete
+    // Sub-phase cycle breakdown within on_task_complete
     uint64_t lock_cycle;           // lock_fanout + state store + unlock
     uint64_t fanout_cycle;         // fanout traversal
     uint64_t fanin_cycle;          // fanin traversal
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
index ca0854716..5893e5013 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
+++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
@@ -124,9 +124,9 @@ void SchedulerContext::complete_slot_task(
         }
     }
 
-    bool mixed_complete = sched_->on_subtask_complete(slot_state);
+    bool task_complete = sched_->on_subtask_complete(slot_state);
 
-    if (mixed_complete && slot_state.payload != nullptr &&
+    if (task_complete && slot_state.payload != nullptr &&
         slot_state.any_subtask_deferred.load(std::memory_order_acquire)) {
         while (!mailbox->try_push_normal_done(slot_state.task->task_id, reinterpret_cast<uint64_t>(&slot_state))) {
             sched_->async_wait_list.mpsc_skipped_count.fetch_add(1, std::memory_order_relaxed);
@@ -135,7 +135,7 @@ void SchedulerContext::complete_slot_task(
         defer_completion_to_consumer = true;
     }
 
-    if (mixed_complete && !defer_completion_to_consumer) {
+    if (task_complete && !defer_completion_to_consumer) {
 #if PTO2_PROFILING
         if (is_dump_tensor_enabled()) {
             dump_tensors_for_task<PTO2_SUBTASK_SLOT_COUNT>(
@@ -153,9 +153,9 @@ void SchedulerContext::complete_slot_task(
         // SCHED_PROFILING variant takes thread_idx for its per-thread atomic
         // counter side-effects (g_sched_*_atomic_count[thread_idx], consumed
         // by the otc_* log lines). Its return value is unused.
-        (void)sched_->on_mixed_task_complete(slot_state, thread_idx, local_bufs);
+        (void)sched_->on_task_complete(slot_state, thread_idx, local_bufs);
 #else
-        sched_->on_mixed_task_complete(slot_state, local_bufs);
+        sched_->on_task_complete(slot_state, local_bufs);
 #endif
 #if PTO2_PROFILING
         l2_swimlane.phase_complete_count++;
diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
index f4335c5ac..cd252c9ae 100644
--- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
+++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
@@ -750,9 +750,9 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_
             for (int di = 0; di < dummy_got; di++) {
                 PTO2TaskSlotState &dummy_slot = *dummy_batch[di];
 #if PTO2_SCHED_PROFILING
-                sched_->on_mixed_task_complete(dummy_slot, thread_idx, local_bufs);
+                sched_->on_task_complete(dummy_slot, thread_idx, local_bufs);
 #else
-                sched_->on_mixed_task_complete(dummy_slot, local_bufs);
+                sched_->on_task_complete(dummy_slot, local_bufs);
 #endif
                 // Dummy tasks have no subtasks to retire and no fanout pre-conditions
                 // beyond their own producers; release self-reference so the slot can
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py
index 680ca1ef4..13efeadd2 100644
--- a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py
@@ -31,7 +31,19 @@
 from simpler_setup.scene_test import _outputs_dir, _sanitize_for_filename
 from simpler_setup.tools.swimlane_converter import read_perf_data
 
-_REQUIRED_TASK_FIELDS = ("task_id", "func_id", "core_id", "core_type", "start_time_us", "end_time_us")
+_REQUIRED_TASK_FIELDS = (
+    "task_id",
+    "func_id",
+    "core_id",
+    "core_type",
+    "start_time_us",
+    "end_time_us",
+    # receive_time_us / local_setup_us are populated unconditionally by the
+    # AICore-side capture (v3 schema). propagation_us requires AICPU dispatch_ts
+    # and is therefore only present at level≥2 — not in this required-set.
+    "receive_time_us",
+    "local_setup_us",
+)
 
 
 def validate_perf_artifact(case_label: str, *, expected_task_count: int | None = None) -> None:
diff --git a/tests/ut/cpp/a2a3/test_wiring.cpp b/tests/ut/cpp/a2a3/test_wiring.cpp
index 0d45762b2..9acb8aacf 100644
--- a/tests/ut/cpp/a2a3/test_wiring.cpp
+++ b/tests/ut/cpp/a2a3/test_wiring.cpp
@@ -13,7 +13,7 @@
  *
  * 1. wire_task()         — fanout wiring, early-finished detection,
  *                          fanin_count initialization, ready push
- * 2. on_mixed_task_complete() — COMPLETED transition, fanout traversal,
+ * 2. on_task_complete() — COMPLETED transition, fanout traversal,
  *                               consumer fanin release
  * 3. on_task_release()   — fanin traversal, producer release,
  *                          self-CONSUMED check
@@ -241,7 +241,7 @@ TEST_F(WiringTest, WireTaskMixedProducerStates) {
 }
 
 // =============================================================================
-// on_mixed_task_complete: notifies consumers via fanout chain
+// on_task_complete: notifies consumers via fanout chain
 // =============================================================================
 
 TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) {
@@ -274,7 +274,7 @@ TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) {
     dep_entries[1].next = &dep_entries[0];
     producer.fanout_head = &dep_entries[1];
 
-    sched.on_mixed_task_complete(producer);
+    sched.on_task_complete(producer);
 
     // Producer should be COMPLETED
     EXPECT_EQ(producer.task_state.load(), PTO2_TASK_COMPLETED);
diff --git a/tests/ut/cpp/a5/test_wiring.cpp b/tests/ut/cpp/a5/test_wiring.cpp
index faffc6a9d..6c8904f81 100644
--- a/tests/ut/cpp/a5/test_wiring.cpp
+++ b/tests/ut/cpp/a5/test_wiring.cpp
@@ -13,7 +13,7 @@
  *
  * 1. wire_task()         — fanout wiring, early-finished detection,
  *                          fanin_count initialization, ready push
- * 2. on_mixed_task_complete() — COMPLETED transition, fanout traversal,
+ * 2. on_task_complete() — COMPLETED transition, fanout traversal,
  *                               consumer fanin release
  * 3. on_task_release()   — fanin traversal, producer release,
  *                          self-CONSUMED check
@@ -231,7 +231,7 @@ TEST_F(WiringTest, WireTaskMixedProducerStates) {
 }
 
 // =============================================================================
-// on_mixed_task_complete: notifies consumers via fanout chain
+// on_task_complete: notifies consumers via fanout chain
 // =============================================================================
 
 TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) {
@@ -264,7 +264,7 @@ TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) {
     dep_entries[1].next = &dep_entries[0];
     producer.fanout_head = &dep_entries[1];
 
-    sched.on_mixed_task_complete(producer);
+    sched.on_task_complete(producer);
 
     // Producer should be COMPLETED
     EXPECT_EQ(producer.task_state.load(), PTO2_TASK_COMPLETED);