From be420f7fb03b4758a669467fcff7aa9e1efada3e Mon Sep 17 00:00:00 2001 From: Chao Wang <26245345+ChaoWao@users.noreply.github.com> Date: Mon, 8 Jun 2026 12:13:32 +0800 Subject: [PATCH] Add: AICore receive_time DFX field + swimlane phase model cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Core change ----------- Capture per-task `receive_time` on AICore immediately after `read_reg(DATA_MAIN_BASE)` returns a new task_id, BEFORE the per-task `dcci(payload, ENTIRE_DATA_CACHE) + write_reg(COND, ACK)` pair that precedes start_time. Stored as a 32-bit delta `start_time - receive_time` in the AICore record (size unchanged, 2 records per cache line). Splits the per-task head_OH into two physically distinct halves so DFX can attribute each: - `propagation_us` = receive_time - dispatch_ts: AICPU->AICore-ready delivery (NoC + any speculation overshoot). - `local_setup_us` = start_time - receive_time: AICore-local critical-path prep (dcci + ack on the common path; ack-only on the speculative-hit path). Semantic alignment with speculative early-dispatch (#1079) --------------------------------------------------------- For not_ready==1 (speculative pre-staging) the dcci ran during the dependency-wait spin, off the critical path. receive_time is re-stamped at the moment the doorbell match exits the gate, so propagation absorbs both the original NoC delivery AND any speculation overshoot, while local_setup stays the pure ack-on-critical-path cost. Common-path emits unchanged. Visualization ------------- - Setup bar at level 1: swimlane_converter emits a `setup` sub-bar (ts=receive_time, dur=local_setup_us) directly before each kernel bar on Worker View. >=1-cycle filter suppresses warm-cache zero-width bars. Base_time tracking includes receive_time so the first cold task no longer renders at a negative offset. - Flow arrows (deps.json -> Worker View) now land at receive_time instead of start_time so the gap between arrow tip and kernel start visually equals local_setup. - New `--enable-swimlane-overhead` flag (pytest + standalone + conftest + L3 subprocess forwarding) opt-in for the 8 Overhead Analysis counter tracks from PR #1039. Scheduler-phase model cleanup ----------------------------- - Drop PR #1079 debug overlays: `Scan` (per-pass MMIO COND scan) and `Poll` (activity-fill attribution) were emitted at level>=3 but carried no actionable signal — "scheduler is polling when there's nothing to do" is the steady state, not a finding. Total Perfetto event count on qwen3 level=4 drops 51,381 -> 2,952. - `Prestage` -> `EarlyDispatch` (PR #1079 internal jargon -> the feature name from its PR title). enum value, function name (`try_speculative_early_dispatch`), queue field, and converter color key all renamed consistently. - `Fanout` -> `Resolve`. The phase covers `on_task_complete`'s consumer-release walk (decrement consumer fanin, push newly-ready, ring speculative doorbells). "Fanout" overloaded the graph-theory term; "Resolve" names the action. - `on_mixed_task_complete` -> `on_task_complete`. The function fires for every task that completes (MIX or single-subtask), the "mixed" modifier was historical. - `bool mixed_complete` -> `task_complete` at the caller. - New `DummyTask` phase kind, emitted once per dummy in `dummy_drain`. Converter routes it to Worker View pid=4 DUMMY_T{thread} lanes so DAG fence/barrier nodes (no AICore presence) are visually present. The accompanying Resolve bar covers the consumer-release work. - Resolve emit >=1µs filter: drops the ~88% of tasks whose consumer-release walk is sub-microsecond, leaving only the broadcast/reduction Resolves that carry signal. `tasks_processed` now carries the real consumer-walk count plumbed back from `on_task_complete` (non-PROFILING return type uint32_t; PROFILING uses `CompletionStats::fanout_edges`). - Resolve emit from dummy_drain too (previously a measurement blind spot — work happened, no bar). pid renumber + Process rename ----------------------------- pid is now in pipeline order (top -> bottom in Perfetto), with sort_index == pid for self-evident layout: pid=1 AICPU Orchestrator submit envelope (earliest) pid=2 AICPU Scheduler Complete/Dispatch/Release/Resolve/EarlyDispatch pid=3 Scheduler View AICPU-eye dispatch->finish per worker pid=4 Worker View AIC_0..23 + AIV_24..71 + DUMMY_T0..N (latest) Was: pid=1 AICore View / pid=2 AICPU View — renamed to Worker View / Scheduler View since AICPU also serves as worker for dummy tasks. No compat shim for old captures. Enum cleanup ------------ `L2SwimlaneSchedPhaseKind` collapsed from 8 (with reserved Poll/Scan) to 6 sequential: 0 Complete, 1 Dispatch, 2 Release, 3 Resolve, 4 EarlyDispatch, 5 DummyTask. a5 mirror --------- Renames mirrored to a5 tensormap_and_ringbuffer (Fanout/Resolve, on _task_complete, mixed_complete) for cross-arch symmetry. a5 has no PR #1079 speculative path so the not_ready re-stamp and Scan/Poll/ Prestage cleanup are a2a3-only. Hot-path cost ------------- - One extra get_sys_cnt_aicore() per task in the AICore executor's task-arrived branch (cycle MSR read, negligible vs the existing per-task dcci(payload, ENTIRE_DATA_CACHE)). - 1µs Resolve filter trades phase records for signal — saves emit bandwidth at level>=3. Verified -------- - a2a3 onboard build + a5 onboard build via `pip install` - ST: TestL2Swimlane + TestL2SwimlaneMixed + TestDummyTask + `--enable-swimlane-overhead` opt-in path - qwen3 decode_layer level=1 and level=4 end-to-end PASS, swimlane artifacts reviewed (setup bars + Resolve bars + DUMMY lanes visible where expected; Scan/Poll bars absent) - swimlane_converter `pid` mapping in output matches spec --- conftest.py | 8 + docs/dfx/l2-swimlane-profiling.md | 19 +- .../2026-06-aicore-cold-start-warmup.md | 86 +++- simpler_setup/scene_test.py | 33 +- simpler_setup/tools/swimlane_converter.py | 368 ++++++++++++++---- .../aicore/l2_swimlane_collector_aicore.h | 15 +- .../include/common/l2_swimlane_profiling.h | 54 ++- .../shared/host/l2_swimlane_collector.cpp | 18 +- .../aicore/aicore_executor.cpp | 12 +- .../aicore/aicore_executor.cpp | 31 +- .../docs/RUNTIME_LOGIC.md | 2 +- .../docs/SUBMIT_BY_CLUSTER.md | 2 +- .../docs/device_log_profiling.md | 2 +- .../runtime/aicore_completion_mailbox.h | 2 +- .../runtime/pto_orchestrator.cpp | 2 +- .../runtime/scheduler/pto_scheduler.h | 43 +- .../scheduler/scheduler_completion.cpp | 69 ++-- .../runtime/scheduler/scheduler_context.h | 2 +- .../runtime/scheduler/scheduler_dispatch.cpp | 146 ++++--- .../runtime/scheduler/scheduler_types.h | 6 - .../aicore/l2_swimlane_collector_aicore.h | 15 +- .../include/common/l2_swimlane_profiling.h | 25 +- .../shared/host/l2_swimlane_collector.cpp | 4 +- .../aicore/aicore_executor.cpp | 12 +- .../aicore/aicore_executor.cpp | 11 +- .../docs/RUNTIME_LOGIC.md | 2 +- .../docs/SUBMIT_BY_CLUSTER.md | 2 +- .../docs/device_log_profiling.md | 2 +- .../runtime/aicore_completion_mailbox.h | 2 +- .../runtime/pto_orchestrator.cpp | 2 +- .../runtime/scheduler/pto_scheduler.h | 12 +- .../scheduler/scheduler_completion.cpp | 10 +- .../runtime/scheduler/scheduler_dispatch.cpp | 4 +- .../dfx/l2_swimlane/_swimlane_validate.py | 14 +- tests/ut/cpp/a2a3/test_wiring.cpp | 6 +- tests/ut/cpp/a5/test_wiring.cpp | 6 +- 36 files changed, 745 insertions(+), 304 deletions(-) diff --git a/conftest.py b/conftest.py index 084911aa2..da4a114d9 100644 --- a/conftest.py +++ b/conftest.py @@ -187,6 +187,14 @@ def pytest_addoption(parser): default=False, help="Enable per-scope peak collection and emit /scope_stats.jsonl (per-scope ring-fill peaks).", ) + parser.addoption( + "--enable-swimlane-overhead", + action="store_true", + default=False, + help="Add the 8 Overhead Analysis counter tracks (per-engine " + "idle/ready/overhead + system all/has overhead) to the swimlane JSON. " + "Requires --enable-l2-swimlane + deps.json (re-run with --enable-dep-gen if absent).", + ) parser.addoption( "--pto-isa-commit", action="store", diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md index 1420e652f..3cc22134e 100644 --- a/docs/dfx/l2-swimlane-profiling.md +++ b/docs/dfx/l2-swimlane-profiling.md @@ -34,12 +34,15 @@ available. [`dep_gen`](dep_gen.md) at post-process time; see [§3.5](#35-dependency-arrows-from-dep_gen). - **AICPU scheduler phases** — per-iteration breakdown into - `complete` / `dispatch`. Idle iterations no longer emit a record - on a2a3; the host tooling reconstructs idle spans from the gap - between consecutive work records on the same thread. Legacy - captures (and a5) may still carry `scan` / `idle` records — both - are silently dropped by the parser (idle is double-painted - by the gap reconstruction; `scan` was never emitted in a2a3). + `complete` / `dispatch` / `release` / `resolve` (Complete sub-bar + for the consumer-release walk) / `early_dispatch` (speculative + pre-staging). Idle iterations no longer emit a record on a2a3; + the host tooling reconstructs idle spans from the gap between + consecutive work records on the same thread. Legacy captures may + carry `scan` / `poll` / `idle` / `fanout` / `prestage` — current + a2a3 builds no longer emit them (PR #1079's Scan/Poll debug + overlay was removed; Fanout was renamed Resolve and now also + filters out <1 µs walks; Prestage was renamed EarlyDispatch). - **Orchestrator submit envelope** — one record per `submit_task()` / `alloc_tensors()` call covering the whole submit's `[start, end]` window (`orch_submit` phase). Per-sub-step @@ -208,7 +211,7 @@ Phase records (per scheduler thread, level >= 3 for | Field | Meaning | | ----- | ------- | | `start_time_us` / `end_time_us` | Phase start / end timestamps in microseconds (reader-side cycle→µs conversion) | -| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` (`scan` / `idle` may appear in legacy captures and a5; both are dropped by the parser). Orchestrator: `orch_submit` — one record per `submit_task()` / `alloc_tensors()` call spanning its full `[start, end]` window. Legacy per-sub-step strings (`orch_sync` / `orch_alloc` / `orch_params` / `orch_lookup` / `orch_insert` / `orch_fanin`) may appear in old captures. | +| `phase` | Lowercase phase name. Scheduler: `complete` / `dispatch` / `release` / `resolve` / `early_dispatch`. Legacy (still parsed for old captures): `scan` / `poll` / `fanout` / `prestage` / `idle` — current a2a3 builds no longer emit Scan/Poll (PR #1079 debug overlay removed) or Prestage (renamed to EarlyDispatch); Fanout was renamed to Resolve. Orchestrator: `orch_submit` — one record per `submit_task()` / `alloc_tensors()` call spanning its full `[start, end]` window. Legacy per-sub-step strings (`orch_sync` / `orch_alloc` / `orch_params` / `orch_lookup` / `orch_insert` / `orch_fanin`) may appear in old captures. | | `loop_iter` (scheduler) / `submit_idx` (orchestrator) | Iteration / submit-call counter for the producing thread | | `tasks_processed` (scheduler) / `task_id` (orchestrator) | Phase-specific union field | | `pop_hit` / `pop_miss` (dispatch only) | Ready-queue pop deltas since the previous dispatch emit | @@ -557,7 +560,7 @@ bounded only by how fast the host drains — not by the per-core buffer sum. **Measured impact.** Hardware bench on a2a3 paged_attention_unroll -Case1 with swimlane=4: rotation design delivers sched −4 µs / orch −19 µs +Case1 with swimlane=4: rotation design delivers sched -4 µs / orch -19 µs vs the upstream/main baseline, comparable to the no-rotation predecessor (which had this PR's earlier commit; the rotation adds about 3 µs sched overhead per session as price for unbounded session length). diff --git a/docs/investigations/2026-06-aicore-cold-start-warmup.md b/docs/investigations/2026-06-aicore-cold-start-warmup.md index f656a2912..43a28d26b 100644 --- a/docs/investigations/2026-06-aicore-cold-start-warmup.md +++ b/docs/investigations/2026-06-aicore-cold-start-warmup.md @@ -45,7 +45,7 @@ cores per task, each task busy-waits 50 µs in `get_sys_cnt()`. Run on a2a3 onboard via `task-submit --device auto --device-num 1 --run "... --enable-l2-swimlane 2"`. Compared three runs, all with the batched-publish dispatch optimization (PR #989) and the eager swimlane -head resolve (also in #989). Head-OH per task = `start_time_us − +head resolve (also in #989). Head-OH per task = `start_time_us - dispatch_time_us`, joined per `(core_id, reg_task_id)` from the level-2 `l2_swimlane_records.json`. @@ -59,7 +59,7 @@ Per-task head-OH (us) across 72 cores: | Eager head resolve, no warmup | 0.36 | 2.61 | 5.62 | 0.35 | 0.39 | 0.23 | | Eager + warmup | **0.32** | **2.29** | 5.64 | 0.36 | 0.36 | 0.31 | -Warmup vs eager-only: t0 avg −0.32 µs (12% reduction), t0 min −0.04 µs +Warmup vs eager-only: t0 avg -0.32 µs (12% reduction), t0 min -0.04 µs (into noise), **t0 max unchanged** (within run-to-run variance). Steady-state t1–t3 unchanged. t0 max for the slowest cores is essentially the same as without warmup. @@ -67,8 +67,8 @@ the same as without warmup. Per-core breakdown for the warmup run reveals the cost is **bimodal**: ```text -Δ = t0 − avg(t1..t3) over 72 cores -min = −0.03 µs (warmup fully effective; t0 ≈ steady state) +Δ = t0 - avg(t1..t3) over 72 cores +min = -0.03 µs (warmup fully effective; t0 ≈ steady state) p25 = 0.53 µs p50 = 1.77 µs p75 = 3.39 µs @@ -139,10 +139,88 @@ property of the dispatch path that future profiling work should expect: the first task of a run should not assume software warmup will fix it. +## Follow-up — receive_time DFX field + what propagation/local_setup actually mean + +PR #1004 adds an AICore-side `receive_time` capture marking the moment +AICPU's full "task is ready to execute" signal landed on the core. With +it the swimlane reports two new per-task fields: + +- `local_setup_us` = start_time - receive_time (AICore-side critical-path + prep between "task genuinely ready" and "kernel starts"). +- `propagation_us` = receive_time - dispatch_ts (AICPU→AICore-ready + delivery, including any scheduling-side waits). + +Where receive_time is stamped depends on the dispatch path: + +- **Common path (`not_ready == 0`)**: stamped immediately after + `read_reg(DATA_MAIN_BASE)` returns the new task_id — that read IS the + ready signal. `local_setup_us` ≈ dcci + ack cost on AICore. +- **Speculative early-dispatch (`not_ready == 1`, PR #1079)**: the + initial task_id only stages the work; the real ready signal is the + doorbell on `DATA_MAIN_BASE` high32 that AICPU rings once the + dependencies resolve. receive_time is re-stamped at the moment the + doorbell match exits the gate-wait spin. The per-task dcci ran during + the spin (overlapped with the dependency wait, off the critical path), + so `local_setup_us` ≈ ack-only cost; `propagation_us` absorbs both the + original NoC delivery AND any speculation overshoot. + +The cleanest reading **only applies to IDLE-phase dispatches** (where +AICPU writes the new task_id to a core whose RUNNING slot is empty — +core was actually idle and polling). In that case: + +- `propagation_us` ≈ AICPU→AICore SoC delivery + AICore's next poll iter + picking up the new value. Hardware-bound at the ns scale per the + cold-start measurement above. +- `local_setup_us` ≈ dcci + ack cost on AICore. Empirically 0–0.02 µs on + warm cache, 0.5–1 µs on cold first task. + +For **PENDING-phase dispatches** (AICPU pre-loads next task_id while +RUNNING slot is still busy), `dispatch_ts` is captured at pre-load time, +which can be tens of µs before the prior task FINs. AICore picks up the +pending task immediately after prior FIN. So: + +- `propagation_us` here is **not** AICPU→AICore delay. It is mostly + "remaining exec time of the prior task at the moment of pre-load", + i.e. a measurement artifact of dual-buffer scheduling, not a physical + delivery delay. +- `local_setup_us` interpretation still holds (dcci + ack on AICore for + the common path, ack-only for the speculative path). + +How to know which kind of dispatch a record represents: compare +`dispatch_ts` against the prior task's `end_time` on the same core +(both available in the swimlane records). `dispatch_ts ≥ prior.end` → +IDLE-phase, propagation reflects real NoC delivery. `dispatch_ts < +prior.end` → PENDING-phase, propagation is a pre-load offset. + +Level-3 capture on `vector_example` (5 tasks, all IDLE-dispatched): + +| Task | head_OH | propagation | local_setup | +| ---- | ------: | ----------: | ----------: | +| Cold first on core 5 | 2.12 µs | 1.20 µs | 0.92 µs | +| Warm on core 5 | 0.16 µs | 0.16 µs | 0.00 µs | +| Cold first on core 6 | 1.66 µs | 0.80 µs | 0.86 µs | +| Warm on core 5 | 0.34 µs | 0.34 µs | 0.00 µs | +| Warm on core 5 | 0.32 µs | 0.32 µs | 0.00 µs | + +Cold tasks pay both halves; warm tasks have local_setup at cycle-zero +(dcci on already-invalidated cache is free) and propagation dominates. +**This pattern is specific to the IDLE-phase regime.** Mid-workload +traces (qwen3 decode_layer) include both IDLE and PENDING dispatches +interleaved, and aggregating `propagation_us` without separating them +yields confusing distributions — the "long tail" in such aggregates is +usually PENDING pre-load offset, not slow NoC. + +Future warmup proposals targeting `local_setup_us` can now be evaluated +directly against captured data. Proposals targeting `propagation_us` +should first filter to IDLE-dispatched records to avoid attacking a +measurement artifact. + ## References - PR #989: dispatch path batched publish (where the eager swimlane head resolve lives). +- PR #1004: AICore receive_time DFX field (the measurement that + confirmed the split above). - PR #988: `spmd_serial_chain_mix` example used for measurement. - Issue #545 comment #2: SPMD dispatch stagger. - `.claude/rules/...` DFX priority guidance. diff --git a/simpler_setup/scene_test.py b/simpler_setup/scene_test.py index 9bb8d486d..ff0d842b9 100644 --- a/simpler_setup/scene_test.py +++ b/simpler_setup/scene_test.py @@ -679,12 +679,18 @@ def _build_output_prefix(case_label: str) -> Path: def _run_swimlane_converter( input_path: Path | None = None, func_names_path: Path | None = None, + enable_overhead: bool = False, ) -> None: """Invoke the bundled swimlane converter as a subprocess. When ``input_path`` is given, the converter derives its output filename from the input's timestamp (see ``swimlane_converter._resolve_output_path``). Without it, the converter auto-selects the latest ``l2_swimlane_records_*.json``. + + ``enable_overhead`` forwards the converter's ``--overhead`` flag — adds the + 8 Overhead Analysis counter tracks (per-engine idle/ready/overhead + system + all/has overhead) under the AICPU Scheduler process. Needs deps.json; the + converter silently no-ops if deps is absent. """ import logging # noqa: PLC0415 import subprocess # noqa: PLC0415 @@ -695,6 +701,8 @@ def _run_swimlane_converter( cmd.append(str(input_path)) if func_names_path is not None: cmd += ["--func-names", str(func_names_path)] + if enable_overhead: + cmd.append("--overhead") try: result = subprocess.run(cmd, check=True, capture_output=True, text=True) if result.stdout: @@ -716,6 +724,7 @@ def _convert_case_swimlane( case_label: str, output_prefix: Path, callable_spec: dict | None = None, + enable_overhead: bool = False, ) -> None: """Post-case: invoke the swimlane converter on the perf file the runtime just wrote into ``/l2_swimlane_records.json``. No diff/rename @@ -736,7 +745,7 @@ def _convert_case_swimlane( safe_label = _sanitize_for_filename(case_label) func_names_path = _dump_name_map(mapping, output_prefix / f"name_map_{safe_label}.json") - _run_swimlane_converter(input_path=perf_file, func_names_path=func_names_path) + _run_swimlane_converter(input_path=perf_file, func_names_path=func_names_path, enable_overhead=enable_overhead) def _run_deps_viewer( @@ -837,6 +846,7 @@ def run_class_cases( # noqa: PLR0913 -- shared layer-5 entry; kwargs mirror CLI enable_dep_gen, enable_scope_stats, enable_device_log_timing=False, + enable_swimlane_overhead=False, ): """Execute a pre-filtered list of cases for one class (layers 5-6). @@ -887,7 +897,12 @@ def run_class_cases( # noqa: PLR0913 -- shared layer-5 entry; kwargs mirror CLI ) finally: if enable_l2_swimlane: - _convert_case_swimlane(case_label, prefix, callable_spec=callable_spec) + _convert_case_swimlane( + case_label, + prefix, + callable_spec=callable_spec, + enable_overhead=enable_swimlane_overhead, + ) if enable_dep_gen: _graph_case_dep_gen(case_label, prefix, callable_spec=callable_spec) if enable_scope_stats: @@ -1354,6 +1369,7 @@ def test_run(self, st_platform, st_worker, request): # device-log timing is cheap (PTO2_PROFILING markers, one block/round) # so unlike the heavy diagnostics it is NOT disabled when --rounds > 1. enable_device_log_timing = request.config.getoption("--enable-device-log-timing", default=False) + enable_swimlane_overhead = request.config.getoption("--enable-swimlane-overhead", default=False) if rounds > 1: if enable_l2_swimlane: logger.warning("Profiling disabled: --rounds > 1") @@ -1409,6 +1425,7 @@ def test_run(self, st_platform, st_worker, request): enable_dep_gen=enable_dep_gen, enable_scope_stats=enable_scope_stats, enable_device_log_timing=enable_device_log_timing, + enable_swimlane_overhead=enable_swimlane_overhead, ) # ------------------------------------------------------------------ @@ -1507,6 +1524,15 @@ def run_module(module_name): # noqa: PLR0912, PLR0915 -- CLI parsing + dispatch help="Enable per-scope peak collection and emit /scope_stats/scope_stats.jsonl " "(per-scope ring-fill peaks).", ) + parser.add_argument( + "--enable-swimlane-overhead", + action="store_true", + default=False, + help="Add the 8 Overhead Analysis counter tracks (per-engine " + "idle/ready/overhead + system all/has overhead) to the swimlane " + "JSON. Requires --enable-l2-swimlane + deps.json (re-run with " + "--enable-dep-gen if absent).", + ) parser.add_argument( "--runtime", default=None, @@ -1722,6 +1748,7 @@ def run_module(module_name): # noqa: PLR0912, PLR0915 -- CLI parsing + dispatch enable_dep_gen=args.enable_dep_gen, enable_scope_stats=args.enable_scope_stats, enable_device_log_timing=args.enable_device_log_timing, + enable_swimlane_overhead=args.enable_swimlane_overhead, ) print("PASSED") except Exception as e: # noqa: BLE001 @@ -1767,6 +1794,8 @@ def _dispatch_test_phases_standalone(module_name, selected_by_cls, args): # noq common.append("--enable-scope-stats") if args.enable_device_log_timing: common.append("--enable-device-log-timing") + if args.enable_swimlane_overhead: + common.append("--enable-swimlane-overhead") # ----- L3 phase: one subprocess per class (not per case). # The child's _create_standalone_worker allocates max(cls.CASES.device_count) diff --git a/simpler_setup/tools/swimlane_converter.py b/simpler_setup/tools/swimlane_converter.py index 55e95ee9f..d46dabb08 100644 --- a/simpler_setup/tools/swimlane_converter.py +++ b/simpler_setup/tools/swimlane_converter.py @@ -119,12 +119,23 @@ def read_perf_data(filepath): # noqa: PLR0912, PLR0915 "core_types": ["aic"|"aiv", ...], # indexed by core_id "core_to_thread": [, ...] # optional (level >= 3) }, - "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, end_cycles], ...], + "aicore_tasks": [[core_id, task_token_raw, reg_task_id, start_cycles, + end_cycles, receive_to_start_cycles], ...], "aicpu_tasks": [[core_id, reg_task_id, dispatch_cycles, finish_cycles], ...], "aicpu_scheduler_phases": [ [ {kind, start_cycles, end_cycles, ...}, ... ], ... ], "aicpu_orchestrator_phases": [ [ {submit_idx, task_id, start_cycles, end_cycles}, ... ], ... ] } + aicore_tasks columns (v3 schema): the trailing receive_to_start_cycles + is a uint32 delta = AICore-side `start_time - receive_time`, where + receive_time is captured immediately after AICore's + `read_reg(DATA_MAIN_BASE)` returns the new task_id (before the per-task + dcci + ack pair). Lets DFX split per-task head_OH into the + AICPU→AICore NoC propagation (dispatch_ts → receive_time, hardware- + bound) and the AICore-local dcci + ack cost (receive_time → start_time, + software-tunable). Archived v2 JSON without this column still parses; + the field is exposed as 0 for those. + Returns a dict shaped for `generate_chrome_trace_json`, `print_task_statistics`, and `sched_overhead_analysis`: `tasks`, `aicpu_scheduler_phases`, `aicpu_orchestrator_phases`, @@ -171,13 +182,19 @@ def read_perf_data(filepath): # noqa: PLR0912, PLR0915 # same PTO2 task_token_raw to the same core (SPMD over-subscription, MIX # cluster spread) each get their own reg_task_id, so this key is unique # per dispatch even when task_token_raw collides. - aicore_lookup: dict[tuple[int, int], tuple[int, int, int]] = {} + # + # `*rest` makes v2 rows (5 cols, no receive_to_start_cycles) and v3 rows + # (6 cols) both parse — archived JSON from before the receive_time split + # still loads with r2s_cycles defaulting to 0. + aicore_lookup: dict[tuple[int, int], tuple[int, int, int, int]] = {} for row in aicore_rows: - core_id, task_token_raw, reg_task_id, start_cycles, end_cycles = row + core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, *rest = row + r2s_cycles = int(rest[0]) if rest else 0 aicore_lookup[(int(core_id), int(reg_task_id))] = ( int(task_token_raw), int(start_cycles), int(end_cycles), + r2s_cycles, ) # base_time = min non-zero timestamp across every stream that will be @@ -189,9 +206,17 @@ def _track(v): if v > 0 and (base_time_cycles is None or v < base_time_cycles): base_time_cycles = v - for _, _, _, s, e in aicore_rows: - _track(int(s)) - _track(int(e)) + for row in aicore_rows: + # Column count varies (v2: 5, v3: 6); only the timing columns matter + # for base_time tracking. For v3, the per-task receive_time = + # start_cycles - receive_to_start_cycles is earlier than start_cycles + # itself; track it so the v3 setup-bar (ts = receive_time) doesn't + # land at a negative offset relative to the kernel-bar base. + start_c = int(row[3]) + end_c = int(row[4]) + r2s_c = int(row[5]) if len(row) > 5 else 0 + _track(start_c - r2s_c) + _track(end_c) for _, _, d, f in aicpu_rows: _track(int(d)) _track(int(f)) @@ -231,9 +256,15 @@ def _core_type(core_id): if ac is None: unmatched_per_core[core_id] += 1 continue - task_token_raw, start_cycles, end_cycles = ac + task_token_raw, start_cycles, end_cycles, r2s_cycles = ac start_us = _to_us(start_cycles) end_us = _to_us(end_cycles) + dispatch_us = _to_us(int(dispatch_cycles)) + # receive_to_start delta is in cycles; convert via the same + # cycles_to_us_factor that drives the absolute timestamps. No + # base_time subtraction — this is a delta. + local_setup_us = r2s_cycles * cycles_to_us_factor + receive_us = start_us - local_setup_us tasks.append( { "task_id": task_token_raw, @@ -244,19 +275,24 @@ def _core_type(core_id): "start_time_us": start_us, "end_time_us": end_us, "duration_us": end_us - start_us, - "dispatch_time_us": _to_us(int(dispatch_cycles)), + "dispatch_time_us": dispatch_us, "finish_time_us": _to_us(int(finish_cycles)), + "receive_time_us": receive_us, + "local_setup_us": local_setup_us, + "propagation_us": receive_us - dispatch_us, } ) elif level == 1: # AICORE_TIMING fallback: AICPU records are absent (complete_task # bypassed). The AICore stream alone is the source of truth. for row in aicore_rows: - core_id, task_token_raw, _reg_task_id, start_cycles, end_cycles = row + core_id, task_token_raw, _reg_task_id, start_cycles, end_cycles, *rest = row + r2s_cycles = int(rest[0]) if rest else 0 core_id = int(core_id) task_token_raw = int(task_token_raw) start_us = _to_us(int(start_cycles)) end_us = _to_us(int(end_cycles)) + local_setup_us = r2s_cycles * cycles_to_us_factor tasks.append( { "task_id": task_token_raw, @@ -269,6 +305,9 @@ def _core_type(core_id): "duration_us": end_us - start_us, "dispatch_time_us": 0.0, "finish_time_us": 0.0, + "receive_time_us": start_us - local_setup_us, + "local_setup_us": local_setup_us, + # propagation_us requires AICPU dispatch_ts; absent at level 1. } ) @@ -537,6 +576,8 @@ def print_task_statistics(tasks, func_id_to_name=None): "durations": [], "head_overheads": [], "tail_overheads": [], + "propagations": [], # dispatch_ts → AICore receive_time (NoC + FFTS) + "local_setups": [], # receive_time → start_time (dcci + ack on AICore) "latencies": [], "total_exec_time": 0.0, "total_latency": 0.0, @@ -567,6 +608,13 @@ def print_task_statistics(tasks, func_id_to_name=None): tail_overhead = finish_time - end_time func_stats[func_id]["tail_overheads"].append(tail_overhead) + # Head OH split (v3 schema only — falls back to absent when the + # AICore record came from a pre-receive_time build). + if "propagation_us" in task: + func_stats[func_id]["propagations"].append(task["propagation_us"]) + if "local_setup_us" in task: + func_stats[func_id]["local_setups"].append(task["local_setup_us"]) + # Latency: finish_time_us - dispatch_time_us latency = finish_time - dispatch_time func_stats[func_id]["latencies"].append(latency) @@ -580,15 +628,17 @@ def print_task_statistics(tasks, func_id_to_name=None): max_finish_time = max(max_finish_time, finish_time) # Print statistics - print("\n" + "=" * 110) + print("\n" + "=" * 140) print("Task Statistics by Function") print(" Exec = kernel time on AICore; Latency = dispatch->finish (incl. head OH + Exec + tail OH)") - print("=" * 110) + print(" Head OH split (v3): Prop = NoC propagation (dispatch_ts→AICore receive); Local = dcci+ack (receive→start)") + print("=" * 140) print( f"{'Func_ID':<8} {'Func_Name':<12} {'Count':>5} {'Avg Exec(us)':>12} " - f"{'Avg Latency(us)':>15} {'Exec%':>6} {'Avg Head OH(us)':>15} {'Avg Tail OH(us)':>15}" + f"{'Avg Latency(us)':>15} {'Exec%':>6} {'Avg Head OH(us)':>15} {'Avg Tail OH(us)':>15} " + f"{'Avg Prop(us)':>12} {'Avg Local(us)':>13}" ) - print("-" * 110) + print("-" * 140) # Sort by func_id for consistent output total_count = 0 @@ -619,17 +669,25 @@ def print_task_statistics(tasks, func_id_to_name=None): sum(stats["tail_overheads"]) / len(stats["tail_overheads"]) if stats["tail_overheads"] else 0 ) avg_latency = stats["total_latency"] / count if count > 0 else 0 + # `None` (not NaN) signals "no v3 receive_time data on this func" so + # the print line below renders a dash. NaN would force ruff's + # PLR0124 self-compare idiom. + avg_propagation = sum(stats["propagations"]) / len(stats["propagations"]) if stats["propagations"] else None + avg_local_setup = sum(stats["local_setups"]) / len(stats["local_setups"]) if stats["local_setups"] else None # Calculate execution ratio: total_exec_time / total_latency exec_ratio = (stats["total_exec_time"] / stats["total_latency"] * 100) if stats["total_latency"] > 0 else 0 + prop_str = f"{avg_propagation:>12.2f}" if avg_propagation is not None else f"{'-':>12}" + local_str = f"{avg_local_setup:>13.2f}" if avg_local_setup is not None else f"{'-':>13}" print( f"{func_id:<8} {func_name:<12} {count:>5} {avg_duration:>12.2f} {avg_latency:>15.2f} " - f"{exec_ratio:>5.1f}% {avg_head_overhead:>15.2f} {avg_tail_overhead:>15.2f}" + f"{exec_ratio:>5.1f}% {avg_head_overhead:>15.2f} {avg_tail_overhead:>15.2f} " + f"{prop_str} {local_str}" ) # Print total row - print("-" * 110) + print("-" * 140) # Calculate total latency (sum of all latencies) total_latency_sum = sum(stats["total_latency"] for stats in func_stats.values()) @@ -656,11 +714,11 @@ def print_task_statistics(tasks, func_id_to_name=None): print("=" * 110) -def build_overhead_counter_events(tasks, deps_edges, pid=3): # noqa: PLR0912 +def build_overhead_counter_events(tasks, deps_edges, pid=2): # noqa: PLR0912 """Per-engine + system overhead counter tracks for the Perfetto trace. Emits 8 counter (``"ph":"C"``) series under the AICPU Scheduler process - (``pid=3``) — it is scheduler-overhead analysis, so it lives in the sched + (``pid=2``) — it is scheduler-overhead analysis, so it lives in the sched group — overlaid alongside the scheduler's own tracks and the AICore task bars to see, at every instant, why time is or is not wasted (see docs/dfx/sched-overhead-model.md): @@ -755,7 +813,7 @@ def _u64(x): "oh_all_overhead": 9121, "oh_has_overhead": 9122, } - # No process metadata: pid=3 is the AICPU Scheduler process (named/sorted + # No process metadata: pid=2 is the AICPU Scheduler process (named/sorted # elsewhere). Emitting it here would override that — these counters just # join the scheduler group as extra tracks (names prefixed "oh_"). events = [] @@ -825,10 +883,10 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 core_to_thread: Optional list mapping core_id (index) to scheduler thread index (-1 = unassigned) Generates processes in the trace: - - pid=1 "AICore View": start_time_us to end_time_us (kernel execution) - - pid=2 "AICPU View": dispatch_time_us to finish_time_us (AICPU perspective) - - pid=3 "AICPU Scheduler": scheduler phase bars (l2_swimlane_level >= 3) - - pid=4 "AICPU Orchestrator": orchestrator phase bars or summary (l2_swimlane_level >= 4) + - pid=4 "Worker View": start_time_us to end_time_us (kernel execution) + - pid=3 "Scheduler View": dispatch_time_us to finish_time_us (AICPU perspective) + - pid=2 "AICPU Scheduler": scheduler phase bars (l2_swimlane_level >= 3) + - pid=1 "AICPU Orchestrator": orchestrator phase bars or summary (l2_swimlane_level >= 4) """ if verbose: print("Generating Chrome Trace JSON...") @@ -851,7 +909,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 # Recover func_id for AICORE_TIMING (level=1) records, which the host # emits as func_id=-1. Resolve once here against dep_gen's per-task # kernel_ids[3] (picking the subslot by core_type) and write it back onto - # the task, so every downstream consumer — AICore View, AICPU View, and + # the task, so every downstream consumer — Worker View, Scheduler View, and # event-hints — sees the same real func_id. See # resolve_func_id_from_kernel_map() for the AIV0-vs-AIV1 tie-break and the # host-side contract. @@ -865,20 +923,25 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 # Step 2: Generate JSON events events = [] - # Metadata event: Process names and sort order - # Display order: Orchestrator (pid=4) → Scheduler (pid=3) → AICPU View (pid=2) → AICore View (pid=1) - events.append({"args": {"name": "AICore View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 1}) - events.append({"args": {"sort_index": 4}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 1}) + # Metadata event: Process names and sort order. + # pid is renumbered in pipeline order (top → bottom in Perfetto): + # pid=1 AICPU Orchestrator (submits tasks — earliest) + # pid=2 AICPU Scheduler (pops ready, dispatches, completes) + # pid=3 Scheduler View (AICPU-eye view of each worker's dispatch→finish) + # pid=4 Worker View (physical AIC/AIV + DUMMY virtual workers — latest) + # sort_index intentionally equals pid so JSON ordering is self-evident. + events.append({"args": {"name": "Worker View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 4}) + events.append({"args": {"sort_index": 4}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 4}) # Check if any task has AICPU timestamps has_aicpu_data = any(task.get("dispatch_time_us", 0) >= 0 and task.get("finish_time_us", 0) > 0 for task in tasks) if has_aicpu_data: events.append( - {"args": {"name": "AICPU View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 2} + {"args": {"name": "Scheduler View"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 3} ) events.append( - {"args": {"sort_index": 3}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 2} + {"args": {"sort_index": 3}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 3} ) # Metadata events: Thread names (one per core) @@ -894,7 +957,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 core_type_str = (core_type or "unknown").upper() thread_name = f"{core_type_str}_{core_id}" events.append( - {"args": {"name": thread_name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 1, "tid": tid} + {"args": {"name": thread_name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 4, "tid": tid} ) # Duration events (Complete events "X") @@ -912,6 +975,11 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 for succ in succs: fanin_map[succ].append(pred) + # One clock period at the AICore counter resolution. The device counter + # ticks at 50 MHz on a2a3 (20 ns) — sub-period local_setup intervals are + # noise (warm dcci is cycle-zero), so don't emit invisible zero-width bars. + _SETUP_MIN_US = 0.02 + for task in tasks: tid = core_to_tid[task["core_id"]] ts = task["start_time_us"] @@ -932,6 +1000,35 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 fanout_str = f"{len(fanout_ids)}: [" + ", ".join(format_task_display(x) for x in fanout_ids) + "]" fanin_str = f"{len(fanin_ids)}: [" + ", ".join(format_task_display(x) for x in fanin_ids) + "]" + # Setup bar (receive → start): visualises AICore-local critical-path + # prep (dcci + ack for the common path, ack-only for speculation-hit + # path — see docs/investigations/2026-06-aicore-cold-start-warmup.md). + # Same `name` across all setup bars → Perfetto auto-assigns one + # consistent color, distinct from kernel bars. Emit only when above + # one clock period; warm-cache dcci+ack lands sub-period and would + # render as an invisible zero-width slice. + local_setup_us = task.get("local_setup_us", 0.0) or 0.0 + receive_time_us = task.get("receive_time_us") + if local_setup_us >= _SETUP_MIN_US and receive_time_us is not None: + events.append( + { + "args": { + "event-hint": f"setup: dcci+ack for Task:{tdisp} CoreId:{task['core_id']}", + "local_setup_us": local_setup_us, + "taskId": task["task_id"], + }, + "cat": "event", + "id": event_id, + "name": "setup", + "ph": "X", + "pid": 4, + "tid": tid, + "ts": receive_time_us, + "dur": local_setup_us, + } + ) + event_id += 1 + events.append( { "args": { @@ -945,7 +1042,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "id": event_id, "name": task_name, "ph": "X", - "pid": 1, + "pid": 4, "tid": tid, "ts": ts, "dur": dur, @@ -956,7 +1053,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 task_to_event_id[(task["task_id"], task["core_id"])] = event_id event_id += 1 - # AICPU View duration events (dispatch_time to finish_time) + # Scheduler View duration events (dispatch_time to finish_time) # Assign overlapping tasks on the same core to different tids so Perfetto # renders each bar on its own row (Perfetto requires strict nesting on a tid). if has_aicpu_data: @@ -992,7 +1089,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 task_to_aicpu_tid[(task["task_id"], task["core_id"])] = tid aicpu_tid_set.add(tid) - # Thread name metadata for AICPU View (one entry per unique tid used) + # Thread name metadata for Scheduler View (one entry per unique tid used) for core_id, base_tid in core_to_tid.items(): ct_list = _core_aicpu_tasks.get(core_id) core_type_str = ct_list[0]["core_type"].upper() if ct_list else "unknown" @@ -1005,7 +1102,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "cat": "__metadata", "name": "thread_name", "ph": "M", - "pid": 2, + "pid": 3, "tid": base_tid, } ) @@ -1018,7 +1115,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "cat": "__metadata", "name": "thread_name", "ph": "M", - "pid": 2, + "pid": 3, "tid": overflow_tid, } ) @@ -1052,7 +1149,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "id": event_id, "name": task_name, "ph": "X", - "pid": 2, + "pid": 3, "tid": tid, "ts": dispatch_us, "dur": aicpu_dur, @@ -1094,13 +1191,20 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 for succ_task in task_map[succ_task_id]: dst_tid = core_to_tid[succ_task["core_id"]] - dst_ts_start = succ_task["start_time_us"] + # Land the dependency arrow at the consumer's `receive_time` — + # the instant AICore saw the dispatch signal on DATA_MAIN_BASE. + # The gap between this anchor and the kernel's `start_time` is + # the per-task local_setup (dcci + ack), now visible as the + # "setup" sub-bar PR #1004 adds. Fall back to start_time for + # captures without v3 receive_time (older swimlane records or + # AICPU-only joins). + dst_ts_start = succ_task.get("receive_time_us") or succ_task["start_time_us"] dst_event_id = task_to_event_id[(succ_task["task_id"], succ_task["core_id"])] - # Happens-before violation: producer outlived consumer's start. - # Real time order broke the data dependency the graph asserted; - # the runtime got away with it (consumer presumably re-read fresh - # data) but it's a smell — surface it. + # Happens-before violation: producer outlived consumer's + # receive. Real time order broke the data dependency the graph + # asserted; the runtime got away with it (consumer presumably + # re-read fresh data) but it's a smell — surface it. hb_violated = src_ts_end > dst_ts_start flow_name = "hb_violation" if hb_violated else "dependency" if hb_violated: @@ -1113,7 +1217,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "id": flow_id, "name": flow_name, "ph": "s", - "pid": 1, + "pid": 4, "tid": src_tid, "ts": flow_start_us, "bind_id": src_event_id, @@ -1126,7 +1230,7 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 "id": flow_id, "name": flow_name, "ph": "f", - "pid": 1, + "pid": 4, "tid": dst_tid, "ts": dst_ts_start, "bp": "e", @@ -1147,10 +1251,10 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 if scheduler_phases: # Process metadata events.append( - {"args": {"name": "AICPU Scheduler"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 3} + {"args": {"name": "AICPU Scheduler"}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 2} ) events.append( - {"args": {"sort_index": 2}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 3} + {"args": {"sort_index": 2}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 2} ) # Phase color mapping. The Perfetto sched lane only renders the @@ -1162,11 +1266,11 @@ def generate_chrome_trace_json( # noqa: PLR0912, PLR0913, PLR0915 phase_colors = { "complete": "good", # green "dispatch": "terrible", # red - "poll": "yellow", # completion-poll that scanned but retired nothing "release": "olive", # deferred-release drain (on_task_release work) - "fanout": "vsync_highlight_color", # nested in complete: fanin + doorbells - "scan": "thread_state_iowait", # one running-core MMIO COND scan pass - "prestage": "rail_animation", # speculative early-dispatch (Hook 1) staging + "resolve": "vsync_highlight_color", # nested in complete: walk consumer list + doorbells + "early_dispatch": "rail_animation", # speculative early-dispatch (Hook 1) staging + # Rendered separately on Worker View; this color is a fallback if it ever lands on Sched. + "dummy_task": "grey", } # Per-complete subtask-finish counts surface "how many AICore FINs @@ -1220,6 +1324,28 @@ def _find_containing_complete(thread_idx: int, finish_us: float): continue finishes_per_complete[id(t_comp)] += 1 + # Worker View (pid=4) DUMMY lanes: one per scheduler thread, used by + # the `dummy_task` phase below to surface DAG fence / barrier nodes. + # Tids 19000+thread_idx keep them visually grouped after the physical + # AIC/AIV lanes (which sit at 10000+core_id*10). + DUMMY_TID_BASE = 19000 # noqa: N806 + for thread_idx in range(len(scheduler_phases)): + events.append( + { + "args": {"name": f"DUMMY_T{thread_idx}"}, + "cat": "__metadata", + "name": "thread_name", + "ph": "M", + "pid": 4, + "tid": DUMMY_TID_BASE + thread_idx, + } + ) + + # Dummy task X event width — start/end on the device coincide + # (identification is a single MSR read); render as a 0.02 µs sliver + # so Perfetto picks it up instead of collapsing it to a hairline. + DUMMY_BAR_MIN_DUR_US = 0.02 # noqa: N806 + for thread_idx, thread_records in enumerate(scheduler_phases): tid = 3000 + thread_idx @@ -1230,7 +1356,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cat": "__metadata", "name": "thread_name", "ph": "M", - "pid": 3, + "pid": 2, "tid": tid, } ) @@ -1239,9 +1365,37 @@ def _find_containing_complete(thread_idx: int, finish_us: float): # that otherwise hide inside an idle stretch (poll = completion-scan # that retired nothing; release = on_task_release drain). Genuine # spin emits no record and shows as a blank gap. + # + # `dummy_task` is special: it does NOT live on the sched track — + # it represents a DAG fence node briefly inhabiting the AICPU as a + # virtual worker, so we route it to Worker View (pid=4) DUMMY_T{thr}. for record in thread_records: phase = record.get("phase", "unknown") - if phase not in ("complete", "dispatch", "poll", "release", "fanout", "scan", "prestage"): + if phase == "dummy_task": + start_us = record["start_time_us"] + end_us = record["end_time_us"] + dur = max(end_us - start_us, DUMMY_BAR_MIN_DUR_US) + task_id_low32 = record.get("tasks_processed", 0) + events.append( + { + "args": { + "phase": "dummy_task", + "loop_iter": record.get("loop_iter", 0), + "dummy_task_id_low32": task_id_low32, + "event-hint": f"dummy(t{task_id_low32})", + }, + "cat": "event", + "cname": "grey", + "name": f"dummy(t{task_id_low32})", + "ph": "X", + "pid": 4, + "tid": DUMMY_TID_BASE + thread_idx, + "ts": start_us, + "dur": dur, + } + ) + continue + if phase not in ("complete", "dispatch", "release", "resolve", "early_dispatch"): continue start_us = record["start_time_us"] end_us = record["end_time_us"] @@ -1300,7 +1454,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cname": phase_colors.get(phase, "generic_work"), "name": display_name, "ph": "X", - "pid": 3, + "pid": 2, "tid": tid, "ts": start_us, "dur": dur, @@ -1314,8 +1468,9 @@ def _find_containing_complete(thread_idx: int, finish_us: float): # also breaks Perfetto's rate calc (divide-by-zero → NULL). # Track name carries thread index so it reads standalone # even with the thread tree collapsed. Only complete/dispatch - # carry real queue depths; poll/release/fanout zero-fill them, - # so skip their counter samples to avoid spurious 0 dips. + # carry real queue depths; release/resolve/early_dispatch zero- + # fill them, so skip their counter samples to avoid spurious 0 + # dips. if phase not in ("complete", "dispatch"): continue if not depths_valid: @@ -1327,7 +1482,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cat": "queue", "name": local_track_name, "ph": "C", - "pid": 3, + "pid": 2, "tid": tid, "ts": end_us, } @@ -1343,7 +1498,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cat": "queue", "name": "shared_ready_queue", "ph": "C", - "pid": 3, + "pid": 2, "tid": 3999, "ts": end_us, } @@ -1359,7 +1514,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cat": "__metadata", "name": "thread_name", "ph": "M", - "pid": 3, + "pid": 2, "tid": 3999, } ) @@ -1369,7 +1524,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cat": "__metadata", "name": "thread_sort_index", "ph": "M", - "pid": 3, + "pid": 2, "tid": 3999, } ) @@ -1384,10 +1539,10 @@ def _find_containing_complete(thread_idx: int, finish_us: float): # Process metadata orch_process_label = f"AICPU {orchestrator_name}" if orchestrator_name else "AICPU Orchestrator" events.append( - {"args": {"name": orch_process_label}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 4} + {"args": {"name": orch_process_label}, "cat": "__metadata", "name": "process_name", "ph": "M", "pid": 1} ) events.append( - {"args": {"sort_index": 1}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 4} + {"args": {"sort_index": 1}, "cat": "__metadata", "name": "process_sort_index", "ph": "M", "pid": 1} ) # Thread name metadata for each orchestrator thread @@ -1395,7 +1550,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): tid = 4000 + orch_idx name = f"Orch_{orch_idx}" events.append( - {"args": {"name": name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 4, "tid": tid} + {"args": {"name": name}, "cat": "__metadata", "name": "thread_name", "ph": "M", "pid": 1, "tid": tid} ) # Per-task orchestrator phase bars. As of PR-X the device folds @@ -1413,6 +1568,37 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "orch_fanin": "rail_animation", } + # Build the regular-task and dummy-task id sets so the orch-phase loop + # below can identify which orch_submit envelopes belong to + # `alloc_tensors()` calls (no AICore record, no DummyTask phase — purely + # inline-completed by the orchestrator on host CPU). Those get a + # parallel ALLOC bar on Worker View pid=4 so the DAG node is visible. + regular_task_ids = {int(t.get("task_id", -1)) for t in tasks} + dummy_low32 = set() + if scheduler_phases: + for thread_records in scheduler_phases: + for rec in thread_records: + if rec.get("phase") == "dummy_task": + dummy_low32.add(rec.get("tasks_processed", 0)) + + # ALLOC lane metadata on Worker View (pid=4). One ALLOC_T{orch_idx} + # lane per orchestrator thread (mirror DUMMY_T{thread_idx} convention). + # Tids 18000+orch_idx keep ALLOC lanes grouped just before the DUMMY + # lanes (tids 19000+) so the Worker View reads: + # AIC_* / AIV_* / ALLOC_T* / DUMMY_T* + ALLOC_TID_BASE = 18000 # noqa: N806 + for orch_idx in range(len(orchestrator_phases)): + events.append( + { + "args": {"name": f"ALLOC_T{orch_idx}"}, + "cat": "__metadata", + "name": "thread_name", + "ph": "M", + "pid": 4, + "tid": ALLOC_TID_BASE + orch_idx, + } + ) + for orch_idx, thread_records in enumerate(orchestrator_phases): tid = 4000 + orch_idx for record in thread_records: @@ -1438,14 +1624,42 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "cname": orch_phase_colors.get(phase, "generic_work"), "name": label, "ph": "X", - "pid": 4, + "pid": 1, "tid": tid, "ts": start_us, "dur": dur, } events.append(event) - # AICPU View fanout arrows (duplicate AICore View flow events using AICPU timestamps) + # Classification: orch_submit envelopes whose task_id is NOT in + # the AICore record set AND NOT in DummyTask phases are + # alloc_tensors() calls. Render a parallel "alloc(...)" bar on + # Worker View pid=4 ALLOC_T{orch_idx} so DAG nodes for alloc + # tasks are visible (matches the DUMMY_T treatment). + if phase == "orch_submit" and task_id >= 0: + is_regular = task_id in regular_task_ids + task_low32 = task_id & 0xFFFFFFFF + is_dummy = task_low32 in dummy_low32 + if not is_regular and not is_dummy: + events.append( + { + "args": { + "phase": "alloc", + "task_id": task_id, + "event-hint": f"alloc({format_task_display(task_id)})", + }, + "cat": "event", + "cname": "olive", + "name": f"alloc({format_task_display(task_id)})", + "ph": "X", + "pid": 4, + "tid": ALLOC_TID_BASE + orch_idx, + "ts": start_us, + "dur": max(dur, 0.02), + } + ) + + # Scheduler View fanout arrows (duplicate Worker View flow events using AICPU timestamps) if has_aicpu_data: for task in tasks: src_finish_us = task.get("finish_time_us", 0) @@ -1480,7 +1694,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": aicpu_flow_name, "ph": "s", - "pid": 2, + "pid": 3, "tid": src_tid, "ts": src_finish_us - 0.01, } @@ -1493,7 +1707,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": aicpu_flow_name, "ph": "f", - "pid": 2, + "pid": 3, "tid": dst_tid, "ts": dst_dispatch_us, "bp": "e", @@ -1604,7 +1818,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "complete", "ph": "s", - "pid": 1, + "pid": 4, "tid": src_tid, "ts": last_end_us - FLOW_EPSILON_US, } @@ -1615,7 +1829,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "complete", "ph": "f", - "pid": 3, + "pid": 2, "tid": dst_tid, "ts": comp["start_time_us"], "bp": "e", @@ -1699,7 +1913,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "complete→ready", "ph": "s", - "pid": 3, + "pid": 2, "tid": src_tid, # Anchor inside the complete phase X event so # clicking the complete block surfaces this arrow. @@ -1712,7 +1926,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "complete→ready", "ph": "f", - "pid": 3, + "pid": 2, "tid": 3000 + d_thr, "ts": d_us, "bp": "e", @@ -1770,14 +1984,14 @@ def _find_containing_complete(thread_idx: int, finish_us: float): core_tid = core_to_tid[task["core_id"]] aicpu_tid = task_to_aicpu_tid.get((task["task_id"], task["core_id"]), core_tid) - # Flow: scheduler DISPATCH → AICore View task start + # Flow: scheduler DISPATCH → Worker View task start events.append( { "cat": "flow", "id": flow_id, "name": "dispatch", "ph": "s", - "pid": 3, + "pid": 2, "tid": sched_tid, "ts": dispatch_us, } @@ -1788,7 +2002,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "dispatch", "ph": "f", - "pid": 1, + "pid": 4, "tid": core_tid, "ts": task["start_time_us"], "bp": "e", @@ -1796,7 +2010,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): ) flow_id += 1 - # Flow: scheduler DISPATCH → AICPU View task start + # Flow: scheduler DISPATCH → Scheduler View task start aicpu_eid = task_to_aicpu_event_id.get((task["task_id"], task["core_id"])) events.append( { @@ -1804,7 +2018,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "dispatch", "ph": "s", - "pid": 3, + "pid": 2, "tid": sched_tid, "ts": dispatch_us, } @@ -1814,7 +2028,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": "dispatch", "ph": "f", - "pid": 2, + "pid": 3, "tid": aicpu_tid, "ts": dispatch_us, "bp": "e", @@ -1883,7 +2097,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": flow_name, "ph": "s", - "pid": 4, + "pid": 1, "tid": orch_tid, "ts": anchor_us, } @@ -1894,7 +2108,7 @@ def _find_containing_complete(thread_idx: int, finish_us: float): "id": flow_id, "name": flow_name, "ph": "f", - "pid": 3, + "pid": 2, "tid": sched_tid, "ts": dispatch_us, "bp": "e", diff --git a/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h b/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h index 39450f33c..79c9252f0 100644 --- a/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h +++ b/src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h @@ -97,12 +97,20 @@ struct L2SwimlaneAicoreLocalState { * the same core — each dispatch needs its own AICore * record matched to its own AICPU record, which * task_token_raw alone cannot disambiguate. - * @param start_time Start timestamp (get_sys_cnt) + * @param receive_time Timestamp captured immediately after `read_reg(DATA_MAIN_BASE)` + * returns the new task_id (before dcci+ack). Stored as a + * 32-bit delta `start_time - receive_time` in the record; + * host recovers `receive_time = start_time - delta`. + * Lets DFX split head_OH into the unfixable + * AICPU→AICore NoC propagation (dispatch_ts → receive_time) + * and the AICore-local dcci+ack cost (receive_time → start_time). + * @param start_time Start timestamp (get_sys_cnt) — post-dcci+ack, just + * before kernel `execute_task`. * @param end_time End timestamp */ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_record_task( __gm__ L2SwimlaneActiveHead *head, L2SwimlaneAicoreLocalState *local, uint64_t task_token_raw, uint32_t reg_task_id, - uint64_t start_time, uint64_t end_time + uint64_t receive_time, uint64_t start_time, uint64_t end_time ) { // Re-fetch head channel each task; cheap relative to the // baseline `dcci(payload, ENTIRE_DATA_CACHE)` we already pay per task. @@ -132,6 +140,9 @@ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_ record->end_time = end_time; record->task_token_raw = task_token_raw; record->reg_task_id = reg_task_id; + // 32-bit delta; receive_time always precedes start_time on the same core + // (sys_cnt is monotonic per AICore), so the subtraction can never wrap. + record->receive_to_start_cycles = static_cast(start_time - receive_time); local->slot_within_buf = slot + 1; // Flush record to GM so host can read it after the buffer is enqueued. diff --git a/src/a2a3/platform/include/common/l2_swimlane_profiling.h b/src/a2a3/platform/include/common/l2_swimlane_profiling.h index 892140c61..4ed94a588 100644 --- a/src/a2a3/platform/include/common/l2_swimlane_profiling.h +++ b/src/a2a3/platform/include/common/l2_swimlane_profiling.h @@ -158,16 +158,25 @@ static_assert(sizeof(L2SwimlaneAicpuTaskRecord) == 32, "L2SwimlaneAicpuTaskRecor * reg_task_id, giving a clean 1:1 join even when multiple dispatches * of the same task land on the same core. * - * Layout: 24B identity/timing + 4B reg_task_id + 4B pad → 32B (half a - * cache line). Two records pack into one cache line so AICore's per-task - * store is at most a single line commit + dcci. + * Layout: 24B identity/timing + 4B reg_task_id + 4B receive_to_start delta → + * 32B (half a cache line). Two records pack into one cache line so AICore's + * per-task store is at most a single line commit + dcci. + * + * receive_to_start_cycles isolates the AICore-side dcci+ack cost from the + * AICPU→AICore NoC propagation. AICore captures receive_time right after + * `read_reg(DATA_MAIN_BASE)` returns the new task_id (before dcci+ack), and + * start_time after them. Host derives: + * - receive_time = start_time - receive_to_start_cycles + * - propagation = receive_time - dispatch_ts (AICPU view) + * - local_setup = receive_to_start_cycles (dcci + ack) + * Delta fits in 32 bits at any platform clock (50 MHz @ 32-bit ≈ 85 s). */ struct L2SwimlaneAicoreTaskRecord { - uint64_t start_time; // Task start timestamp (get_sys_cnt) - uint64_t end_time; // Task end timestamp - uint64_t task_token_raw; // PTO2TaskId::raw — identity (NOT join key) - uint32_t reg_task_id; // Per-core dispatch token — host join key vs AICPU stream - uint32_t _pad; + uint64_t start_time; // Post-dcci+ack timestamp (kernel begins next) + uint64_t end_time; // Post-kernel timestamp + uint64_t task_token_raw; // PTO2TaskId::raw — identity (NOT join key) + uint32_t reg_task_id; // Per-core dispatch token — host join key vs AICPU stream + uint32_t receive_to_start_cycles; // start_time - receive_time (AICore-local dcci + ack cost) } __attribute__((aligned(32))); static_assert(sizeof(L2SwimlaneAicoreTaskRecord) == 32, "L2SwimlaneAicoreTaskRecord must be 32B"); @@ -471,19 +480,22 @@ static_assert(sizeof(L2SwimlaneDataHeader) % 64 == 0, "L2SwimlaneDataHeader must enum class L2SwimlaneSchedPhaseKind : uint32_t { Complete = 0, // Process completed tasks (fanin traversal) Dispatch = 1, // Dispatch ready tasks to idle cores - // Surfaced for loop iterations that produce no Complete/Dispatch bar but - // still do real work — otherwise hidden inside the "idle" stretch. - Poll = 2, // Completion poll scanned running cores but retired nothing - Release = 3, // Deferred-release drain (on_task_release fanout/refcount work) - // Nested child kinds — rendered inside their parent Complete/Dispatch bar - // (Perfetto stacks time-contained events on the same lane). - Fanout = 4, // on_mixed_task_complete: fanin release + consumer doorbells - Scan = 5, // one check_running_cores pass — tasks_processed = cores' MMIO COND read - // Speculative early-dispatch (Hook 1): building + publishing a flagged - // producer's consumer's gated blocks. Its own bar so a long staging pass is - // NOT mislabeled Poll — it runs on the producer's owner thread and the - // payload-build cost is real (~us/subtask), so it must be visible directly. - Prestage = 6, // try_speculative_prestage — tasks_processed = blocks staged + Release = 2, // Deferred-release drain (on_task_release work) + // The following are nested inside their parent Complete/Dispatch bar + // (Perfetto stacks time-contained events on the same lane), except + // DummyTask which the converter routes to Worker View pid=4. + Resolve = 3, // on_task_complete: walk consumer list, decrement fanin, + // push newly-ready successors, ring doorbells for + // speculative hits. tasks_processed = # consumers visited. + EarlyDispatch = 4, // try_speculative_early_dispatch — speculative pre-staging + // of a flagged producer's consumer's gated blocks. + // tasks_processed = blocks staged this pass. + DummyTask = 5, // Dummy (DAG fence / barrier node) identity marker + // emitted once per dummy in dummy_drain. No AICore + // presence; the converter renders this as a thin bar on + // Worker View's DUMMY_T{thread} lane. tasks_processed = + // task_token_raw low 32 bits. The surrounding Resolve + // bar captures the consumer-release work that follows. }; /** Index layout of the queue-depth snapshot arrays below: AIC=0, AIV=1, MIX=2. diff --git a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp index 088ee645e..5f4459660 100644 --- a/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp +++ b/src/a2a3/platform/shared/host/l2_swimlane_collector.cpp @@ -764,7 +764,7 @@ int L2SwimlaneCollector::export_swimlane_json() { // file size). Column order is documented in the schema comment at the top // of swimlane_converter.py's v2 reader. // - // aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles] + // aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles] // aicpu_tasks: [core_id, reg_task_id, dispatch_cycles, finish_cycles] { // copy_aicore_buffer already drops r.start_time == 0 slots when @@ -776,7 +776,7 @@ int L2SwimlaneCollector::export_swimlane_json() { for (const auto &r : collected_aicore_records_[core_idx]) { if (!first) outfile << ","; outfile << "\n [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", " - << r.start_time << ", " << r.end_time << "]"; + << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]"; first = false; total++; } @@ -813,16 +813,14 @@ int L2SwimlaneCollector::export_swimlane_json() { return "complete"; case L2SwimlaneSchedPhaseKind::Dispatch: return "dispatch"; - case L2SwimlaneSchedPhaseKind::Poll: - return "poll"; case L2SwimlaneSchedPhaseKind::Release: return "release"; - case L2SwimlaneSchedPhaseKind::Fanout: - return "fanout"; - case L2SwimlaneSchedPhaseKind::Scan: - return "scan"; - case L2SwimlaneSchedPhaseKind::Prestage: - return "prestage"; + case L2SwimlaneSchedPhaseKind::Resolve: + return "resolve"; + case L2SwimlaneSchedPhaseKind::EarlyDispatch: + return "early_dispatch"; + case L2SwimlaneSchedPhaseKind::DummyTask: + return "dummy_task"; } return "unknown"; }; diff --git a/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp b/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp index 9c50b986e..30417ae8e 100644 --- a/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp +++ b/src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp @@ -84,6 +84,16 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in } { + // receive_time captures the instant DATA_MAIN_BASE returned a new + // task_id, BEFORE the ack write. Paired with start_time (captured + // after task_ptr resolve) it lets DFX split head_OH into the + // AICPU→AICore NoC propagation (dispatch_ts → receive_time, + // hardware-bound) and the AICore-local ack + task_ptr resolve + // (receive_time → start_time). host_build_graph has no per-task + // dcci so the local-setup span is naturally tighter than the + // tensormap_and_ringbuffer runtime; the field still records it. + uint64_t receive_time = get_sys_cnt_aicore(); + uint32_t actual_task_id = task_id; write_reg(RegId::COND, MAKE_ACK_VALUE(actual_task_id)); @@ -118,7 +128,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in // dispatch token coincide and a single value covers both. l2_swimlane_aicore_record_task( l2_swimlane_head, &l2_swimlane_local, static_cast(actual_task_id), - static_cast(actual_task_id), start_time, end_time + static_cast(actual_task_id), receive_time, start_time, end_time ); } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp index 54d3a997c..f10737969 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp @@ -133,6 +133,29 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in } { + // receive_time marks the moment AICPU's full "task is ready to + // execute" signal landed on this core. Paired with start_time + // (captured after the per-task dcci + ack pair) it lets DFX split + // head_OH into the AICPU→AICore-ready propagation (dispatch_ts → + // receive_time, hardware + scheduling-bound) and the AICore-local + // critical-path prep (receive_time → start_time, software-tunable). + // Stored in the record as a 32-bit delta `start_time - receive_time`. + // + // For the common path (not_ready == 0) the new task_id on + // DATA_MAIN_BASE is itself the ready signal, so receive_time is + // stamped immediately and local_setup covers dcci + ack. + // + // For the speculative early-dispatch path (not_ready == 1) the + // dcci ran BEFORE the dependency-wait spin, so its cost is hidden + // behind the doorbell-wait — not on the critical path between + // "task genuinely ready" and "kernel begins". receive_time is + // re-stamped after the doorbell arrives, so propagation absorbs + // both the original NoC delivery AND any speculation overshoot, + // while local_setup stays the pure ack-on-critical-path cost. This + // makes local_setup the clean "AICore prep we can't hide" figure + // for both paths. + uint64_t receive_time = get_sys_cnt_aicore(); + uint32_t task_id = reg_val; // Decode: register holds task_id directly // Select dual-buffer slot: same bit as AICPU used when writing payload @@ -172,6 +195,12 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in write_reg(RegId::COND, AICORE_EXITED_VALUE); break; } + // Re-stamp receive_time at the moment the doorbell landed: the + // dcci above ran during the speculative-staging window + // (overlapped with the dependency wait, off the critical path). + // Propagation now absorbs the speculation overshoot; local_setup + // = start - receive stays the pure ack-on-critical-path cost. + receive_time = get_sys_cnt_aicore(); } write_reg(RegId::COND, MAKE_ACK_VALUE(task_id)); @@ -221,7 +250,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in uint64_t end_time = get_sys_cnt_aicore(); uint64_t task_token_raw = exec_payload->local_context.async_ctx.task_token.raw; l2_swimlane_aicore_record_task( - l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, start_time, end_time + l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, receive_time, start_time, end_time ); } } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md index 38bbf0d53..a3cc143c6 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md @@ -501,7 +501,7 @@ Each scheduler thread runs a tight loop with two main phases: **Phase 1 — Completion Handling**: - Poll register `COND` on each managed core -- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark +- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark **Phase 2 — Dispatch**: diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md index fabdde65f..8cba7e90c 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md @@ -128,7 +128,7 @@ Queueing key is normalized resource shape (not raw slot label). 1. Fanin release/readiness remains dependency-correct and graph-level. 2. Two-stage completion: - `on_subtask_complete(task_id, subslot)` - - `on_mixed_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks` + - `on_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks` 3. Downstream release is triggered once per mixed task completion, not once per subslot. ## 9. Executor Ownership and Numbering diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md index f2bd0aaf6..af661d440 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md @@ -109,7 +109,7 @@ The scheduler loop runs four phases each iteration. Each phase's time is accumul | Phase | What it does | Inline stats | | ----- | ------------ | ------------ | -| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | +| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | | **scan** | Updates the perf profiling header with latest scheduler state | — | | **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) | | **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — | diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h index d67626662..0f73a043a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h @@ -39,7 +39,7 @@ static_assert( // observation flattened from a DeferredCompletionEntry. TASK_NORMAL_DONE // carries the slot_state pointer in `addr` so the consumer can finalize the // AsyncWaitEntry.slot_state binding for tasks whose conditions arrived -// before the FIN thread saw mixed_complete. New kinds may be added in future +// before the FIN thread saw task_complete. New kinds may be added in future // without growing the message — the `_pad[5]` slack is reserved for // kind-specific payload extension. #define MSG_KIND_CONDITION 0u diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 0445c798a..a72b91016 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -889,7 +889,7 @@ TaskOutputTensors PTO2OrchestratorState::alloc_tensors(const Arg &args) { if (prepared.slot_state != nullptr) { // Hidden alloc tasks complete inline in the orchestrator before any // consumer can exist, so they have no fanout to notify and no worker - // subtasks to retire. Running the full on_mixed_task_complete path + // subtasks to retire. Running the full on_task_complete path // would only pay unnecessary fanout_lock / traversal overhead here. // The generic slot initialization done in prepare_task() is still // required so scope_end can release the producer-side reference and diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h index f143bbaf3..f5213dca7 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h @@ -912,7 +912,7 @@ struct PTO2SchedulerState { // fanout). When that producer is thread-local (e.g. a 16-block AIV op filling one // thread's cores), the other threads never see the consumer and its blocks on // their cores can't pre-stage. The first claimer pushes the partially-staged - // consumer here; every idle thread's prestage pass pops one, stages a range onto + // consumer here; every idle thread's early_dispatch pass pops one, stages a range onto // ITS OWN cores (range-claim via next_block_idx), and re-pushes if blocks remain // — exactly mirroring how a partially-dispatched SPMD task is re-pushed to the // ready queue (scheduler_dispatch: pop -> claim -> re-push). A stale/released @@ -974,7 +974,7 @@ struct PTO2SchedulerState { } // Collects consumers released via the speculative-doorbell path during a - // single on_mixed_task_complete fanout walk, so their dispatch_fanin + // single on_task_complete fanout walk, so their dispatch_fanin // propagation runs AFTER the walk — never between two siblings' doorbells. struct SpecReleaseSink { static constexpr int CAP = 32; @@ -1149,16 +1149,22 @@ struct PTO2SchedulerState { /** * Two-stage completion: second stage. - * Called exactly once when all subtasks of a mixed task are done - * (i.e., on_subtask_complete returned true). - * Handles fanout notification, fanin release, and self-consumption check. + * Called exactly once when all subtasks of a task are done (i.e., + * on_subtask_complete returned true). Walks the consumer (fanout) list, + * decrements each consumer's fanin, pushes newly-ready ones, and rings + * doorbells for speculative hits. + * + * Non-PROFILING returns the consumer-walk count (= edges traversed). The + * Resolve swimlane bar reads it to label the bar with how many successors + * actually got resolved. PROFILING returns the richer CompletionStats + * whose `fanout_edges` carries the same number. */ #if PTO2_SCHED_PROFILING CompletionStats #else - void + uint32_t #endif - on_mixed_task_complete( + on_task_complete( PTO2TaskSlotState &slot_state, #if PTO2_SCHED_PROFILING int thread_idx, @@ -1168,6 +1174,8 @@ struct PTO2SchedulerState { ) { #if PTO2_SCHED_PROFILING CompletionStats stats = {0, 0, 0, true}; +#else + uint32_t consumer_walk_count = 0; #endif #if PTO2_SCHED_PROFILING extern uint64_t g_sched_lock_cycle[], g_sched_fanout_cycle[]; @@ -1202,7 +1210,7 @@ struct PTO2SchedulerState { // // Safe on silicon: the producer's slot is already COMPLETED here — every // SPMD block has FIN'd AND dcci-flushed its output to HBM before - // on_mixed_task_complete runs — so a released consumer never reads stale + // on_task_complete runs — so a released consumer never reads stale // producer output. (Batching used to align the released wave, but pushed // every doorbell to the end of the walk, defeating the whole point of // speculative early-dispatch: minimal producer-end -> consumer-start.) @@ -1221,6 +1229,7 @@ struct PTO2SchedulerState { stats.tasks_enqueued++; } #else + consumer_walk_count++; release_fanin_and_check_ready(consumer_slot, local_bufs, &rel_sink); #endif current = current->next; @@ -1234,6 +1243,8 @@ struct PTO2SchedulerState { g_sched_push_wait_cycle[thread_idx] += push_wait; PTO2_SCHED_CYCLE_LAP(g_sched_fanout_cycle[thread_idx]); return stats; +#else + return consumer_walk_count; #endif } @@ -1312,7 +1323,7 @@ struct PTO2SchedulerState { // try_inline_complete_locked: short-circuit NotDeferred completions seen during // drain so they don't grow entries[]. Defined here (not in pto_async_wait.h) -// because PTO2SchedulerState's on_mixed_task_complete signature is only known +// because PTO2SchedulerState's on_task_complete signature is only known // after its full definition above. // // When the deferred_release_slot_states[] buffer is full, drain it via @@ -1321,10 +1332,12 @@ struct PTO2SchedulerState { // rates don't surface as ASYNC_WAIT_OVERFLOW errors. inline bool AsyncWaitList::try_inline_complete_locked(AsyncWaitList::DrainCompletionSink &sink, PTO2TaskSlotState &slot_state) { + // Return value (CompletionStats / consumer-walk count) discarded: + // async-wait drain path has no Resolve swimlane bar attached. #if PTO2_SCHED_PROFILING - sink.sched->on_mixed_task_complete(slot_state, sink.thread_idx, sink.local_bufs); + (void)sink.sched->on_task_complete(slot_state, sink.thread_idx, sink.local_bufs); #else - sink.sched->on_mixed_task_complete(slot_state, sink.local_bufs); + (void)sink.sched->on_task_complete(slot_state, sink.local_bufs); #endif if (*sink.deferred_release_count >= sink.deferred_release_capacity) { while (*sink.deferred_release_count > 0) { @@ -1401,10 +1414,12 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete( } if (entry.normal_done && entry.waiting_completion_count <= 0) { + // Return value (CompletionStats / consumer-walk count) discarded: + // deferred-completion drain has no Resolve swimlane bar attached. #if PTO2_SCHED_PROFILING - sched->on_mixed_task_complete(*entry.slot_state, thread_idx, local_bufs); + (void)sched->on_task_complete(*entry.slot_state, thread_idx, local_bufs); #else - sched->on_mixed_task_complete(*entry.slot_state, local_bufs); + (void)sched->on_task_complete(*entry.slot_state, local_bufs); #endif // Drain deferred_release in place when the buffer fills — same // overflow-drain idiom used by complete_slot_task's inline path @@ -1439,7 +1454,7 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete( #if PTO2_SCHED_PROFILING struct PTO2SchedProfilingData { - // Sub-phase cycle breakdown within on_mixed_task_complete + // Sub-phase cycle breakdown within on_task_complete uint64_t lock_cycle; // lock_fanout + state store + unlock uint64_t fanout_cycle; // fanout traversal uint64_t fanin_cycle; // fanin traversal diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp index 91f89d7aa..37c73cb6a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp @@ -132,7 +132,7 @@ void SchedulerContext::complete_slot_task( if (cond_count > 0) { // Publish "this task is deferred" before on_subtask_complete so the // acq_rel fetch_add inside on_subtask_complete makes the flag - // visible to whichever subtask sees mixed_complete=true (which may + // visible to whichever subtask sees task_complete=true (which may // be this thread or a later one). slot_state.any_subtask_deferred.store(true, std::memory_order_release); @@ -147,17 +147,17 @@ void SchedulerContext::complete_slot_task( } } - bool mixed_complete = sched_->on_subtask_complete(slot_state); + bool task_complete = sched_->on_subtask_complete(slot_state); #if PTO2_PROFILING // Sub-block retire that did not finish the slot: record it so the poll // iteration becomes visible on the scheduler lane (the SPMD harvest tail). - if (!mixed_complete && l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { + if (!task_complete && l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { l2_swimlane.phase_subretire_count++; } #endif - if (mixed_complete && slot_state.payload != nullptr && + if (task_complete && slot_state.payload != nullptr && slot_state.any_subtask_deferred.load(std::memory_order_acquire)) { // Some subtask of this task registered conditions; finish the // registration by handing the slot_state off to the consumer. @@ -168,7 +168,7 @@ void SchedulerContext::complete_slot_task( defer_completion_to_consumer = true; } - if (mixed_complete && !defer_completion_to_consumer) { + if (task_complete && !defer_completion_to_consumer) { #if PTO2_PROFILING if (is_dump_tensor_enabled()) { dump_tensors_for_task( @@ -183,25 +183,41 @@ void SchedulerContext::complete_slot_task( } #endif #if PTO2_PROFILING - // Time the fanout (fanin release + consumer doorbells) so it renders as - // a child bar nested inside this iteration's Complete bar — surfacing how - // much of a long complete is the 50-consumer walk + doorbell vs scan. - uint64_t fanout_t0 = (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0; + // Time Resolve (walk the consumer list, decrement each consumer's + // fanin, push the newly-ready ones, ring doorbells for speculative + // hits) so it renders as a child bar nested inside this iteration's + // Complete bar. The 1 µs floor below filters out the ~88% of tasks + // with 1-2 consumers (~500 ns Resolve) so only the long broadcast / + // reduction walks stand out on the lane. + uint64_t resolve_t0 = (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0; #endif + // [[maybe_unused]] silences -Werror=unused-but-set-variable on the + // profiling-flags-smoke build path where PTO2_PROFILING is OFF and + // the Resolve emit below is excluded. + [[maybe_unused]] uint32_t consumers_resolved = 0; #if PTO2_SCHED_PROFILING // SCHED_PROFILING variant takes thread_idx for its per-thread atomic // counter side-effects (g_sched_*_atomic_count[thread_idx], consumed - // by the otc_* log lines). Its return value is unused. - (void)sched_->on_mixed_task_complete(slot_state, thread_idx, local_bufs); + // by the otc_* log lines). It returns CompletionStats whose + // `fanout_edges` is the consumer-walk count. + consumers_resolved = sched_->on_task_complete(slot_state, thread_idx, local_bufs).fanout_edges; #else - sched_->on_mixed_task_complete(slot_state, local_bufs); + consumers_resolved = sched_->on_task_complete(slot_state, local_bufs); #endif #if PTO2_PROFILING - if (fanout_t0 != 0) { - l2_swimlane_aicpu_record_sched_phase( - thread_idx, L2SwimlaneSchedPhaseKind::Fanout, fanout_t0, get_sys_cnt_aicpu(), - l2_swimlane.sched_loop_count, /*tasks_processed=*/0 - ); + if (resolve_t0 != 0) { + uint64_t resolve_t1 = get_sys_cnt_aicpu(); + // Filter: drop Resolve bars under 1 µs so the lane shows only + // resolves that did meaningful work (high consumer counts or + // doorbells). 50 cycles @ 50 MHz = 1 µs (PLATFORM_PROF_SYS_CNT_FREQ + // is the device sys-cnt frequency). + constexpr uint64_t RESOLVE_EMIT_MIN_CYCLES = PLATFORM_PROF_SYS_CNT_FREQ / 1'000'000; // 1 µs + if (resolve_t1 - resolve_t0 >= RESOLVE_EMIT_MIN_CYCLES) { + l2_swimlane_aicpu_record_sched_phase( + thread_idx, L2SwimlaneSchedPhaseKind::Resolve, resolve_t0, resolve_t1, l2_swimlane.sched_loop_count, + consumers_resolved + ); + } } l2_swimlane.phase_complete_count++; #endif @@ -283,14 +299,6 @@ void SchedulerContext::check_running_cores_for_completion( ) { #if PTO2_SCHED_PROFILING auto &l2_swimlane = sched_l2_swimlane_[thread_idx]; -#endif -#if PTO2_PROFILING - // Time this whole pass and count cores whose MMIO COND we actually read - // (skip-gated cores excluded) — emit a Scan bar so the per-pass MMIO cost - // and effective re-poll cadence are visible directly. - bool scan_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES; - uint64_t scan_t0 = scan_record ? get_sys_cnt_aicpu() : 0; - uint32_t scan_cores = 0; #endif CoreTracker &tracker = core_trackers_[thread_idx]; auto running_core_states = tracker.get_all_running_cores(); @@ -317,9 +325,6 @@ void SchedulerContext::check_running_cores_for_completion( // --- Judgment phase: read register, derive transition --- // Use the precomputed cond_ptr (resolved once in handshake) to skip // the reg_offset switch and reg_addr addition on every poll. -#if PTO2_PROFILING - scan_cores++; // counts cores whose MMIO COND we actually read this pass -#endif uint64_t reg_val = static_cast(*core.cond_ptr); // ARM64 allows Device-nGnRnE -> Normal-cacheable load reorder; the // rmb() pins any AICore-published cacheable reads downstream of the @@ -425,14 +430,6 @@ void SchedulerContext::check_running_cores_for_completion( made_progress = true; } } -#if PTO2_PROFILING - if (scan_record && scan_cores > 0) { - l2_swimlane_aicpu_record_sched_phase( - thread_idx, L2SwimlaneSchedPhaseKind::Scan, scan_t0, get_sys_cnt_aicpu(), - sched_l2_swimlane_[thread_idx].sched_loop_count, scan_cores - ); - } -#endif } // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h index 1ede7375d..3dcf4d7f5 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h @@ -275,7 +275,7 @@ class SchedulerContext { // cores spare, pre-stage the consumers of any RUNNING flagged producer onto // those cores with not_ready=1 (gated). Touches no dependency state — the // task is released by the doorbell at its normal ready-pop (Hook 2). - int32_t try_speculative_prestage(int32_t thread_idx); + int32_t try_speculative_early_dispatch(int32_t thread_idx); // Stage the already-claimed range [start, start+count) of consumer `c` onto // thread_idx's idle (RUNNING slot) then pending (gated-pending, promote-on-FIN) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp index b4dadb1e0..64b054754 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp @@ -620,7 +620,7 @@ int32_t SchedulerContext::stage_consumer_blocks( CoreTracker &tracker = core_trackers_[thread_idx]; // Stamp the real pre-stage time (NOT 0) so the swimlane shows these blocks // dispatched during the producer's run, not at trace start. - uint64_t prestage_ts = get_sys_cnt_aicpu(); + uint64_t early_dispatch_ts = get_sys_cnt_aicpu(); uint64_t my_cores[PTO2_SPEC_CORE_MASK_WORDS] = {0}; // cores this thread gated (for self-ring) int32_t staged = 0; int32_t block = start; @@ -642,7 +642,7 @@ int32_t SchedulerContext::stage_consumer_blocks( if (n == 0) return; wmb(); for (int i = 0; i < n; i++) { - publish_subtask_to_core(handles[i], prestage_ts); + publish_subtask_to_core(handles[i], early_dispatch_ts); int32_t cid = tracker.get_core_id_by_offset(handles[i].core_offset); sched_->spec_doorbell_table[cid].addr = handles[i].reg_addr; sched_->spec_doorbell_table[cid].token = handles[i].reg_task_id; @@ -677,8 +677,8 @@ int32_t SchedulerContext::stage_consumer_blocks( // EVENT-DRIVEN by propagate_dispatch_fanin (a flagged producer's dispatch bumps its // consumers' dispatch_fanin; reaching fanin_count enqueues the consumer) — there is // no per-iteration PULL scan here anymore. This pass only DRAINS the queue. -// Returns the number of blocks staged this pass (for the Prestage swimlane bar). -int32_t SchedulerContext::try_speculative_prestage(int32_t thread_idx) { +// Returns the number of blocks staged this pass (for the EarlyDispatch swimlane bar). +int32_t SchedulerContext::try_speculative_early_dispatch(int32_t thread_idx) { constexpr int PTO2_EARLY_DISPATCH_DRAIN_MAX = 8; // bounded pops per pass CoreTracker &tracker = core_trackers_[thread_idx]; int32_t total_staged = 0; @@ -874,18 +874,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { capture_phase_end(phase_start_local, phase_start_shared); } - // Flush the open activity-fill run (Poll/Idle) as a phase record. Called - // when a Complete/Dispatch bar takes over, when the fill kind changes, and - // at loop exit — so every iteration is attributed and the lane has no blanks. - auto flush_activity_fill = [&]() { - if (l2_swimlane.fill_kind != 0) { - l2_swimlane_aicpu_record_sched_phase( - thread_idx, static_cast(l2_swimlane.fill_kind - 1), l2_swimlane.fill_start, - l2_swimlane.fill_end, l2_swimlane.sched_loop_count, /*tasks_processed=*/0 - ); - l2_swimlane.fill_kind = 0; - } - }; #endif // Wall-clock timestamp of the last completed task on this thread. @@ -904,13 +892,11 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ CYCLE_COUNT_START(); l2_swimlane.sched_loop_count++; uint64_t _t0_phase = _t0; - // Activity-fill bookkeeping: an iteration that emits no Complete/Dispatch - // bar and scanned running cores (retired nothing) is filled with a Poll - // bar; genuine spin stays blank. Release is NOT classified here — it is a - // distinct operation emitted with its own span in the idle branch below. - uint64_t iter_start_ts = _t0; - uint64_t fill_complete_at_start = l2_swimlane.sched_complete_cycle; - bool emitted_phase_bar = false; + // Release is the only "no Complete/Dispatch bar" attribution we keep — + // emitted with its own span in the idle branch below. Iterations that + // only scan/poll show as blank gaps; the per-loop Poll/Scan bars (PR + // #1079 debug overlay) were removed since "scheduler is polling when + // there's nothing to do" carries no actionable signal. // Per-iter lazy shared-queue snapshot: first phase emit in this iter // pays the atomic-load cost, subsequent emits in the same iter reuse // the cached value. Reset here so we re-sample exactly once per iter @@ -1021,7 +1007,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ _t0_phase = _t1; l2_swimlane.phase_complete_count = 0; l2_swimlane.phase_subretire_count = 0; - emitted_phase_bar = true; } } #endif @@ -1060,10 +1045,52 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ int dummy_got = sched_->dummy_ready_queue.pop_batch(dummy_batch, DUMMY_DRAIN_BATCH); for (int di = 0; di < dummy_got; di++) { PTO2TaskSlotState &dummy_slot = *dummy_batch[di]; + + // ----- DummyTask phase: dummy "task" identity marker. -------- + // The dummy has no AICore presence — start ≈ end (1 cycle + // wide, just "we identified it"). Converter renders this on + // Worker View's DUMMY_T{thread} lane so the DAG node is + // visually present. tasks_processed = task_token low 32 bits + // (= local_id within ring) so deps.json flow arrows can land. + // The Resolve work that follows is emitted separately below. +#if PTO2_PROFILING + if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { + uint64_t dummy_marker_t = get_sys_cnt_aicpu(); + uint32_t dummy_id_low32 = static_cast(dummy_slot.task->task_id.raw & 0xFFFFFFFFu); + l2_swimlane_aicpu_record_sched_phase( + thread_idx, L2SwimlaneSchedPhaseKind::DummyTask, dummy_marker_t, dummy_marker_t, + sched_l2_swimlane_[thread_idx].sched_loop_count, dummy_id_low32 + ); + } +#endif + + // ----- Resolve work: walk this dummy's consumer list. ------ + // Same 1 µs filter as the main-path Resolve emit suppresses + // dummies whose consumer release runs sub-microsecond. +#if PTO2_PROFILING + uint64_t dummy_resolve_t0 = + (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) ? get_sys_cnt_aicpu() : 0; +#endif + // [[maybe_unused]] silences -Werror=unused-but-set-variable on + // the profiling-flags-smoke build path where PTO2_PROFILING is + // OFF and the Resolve emit below is excluded. + [[maybe_unused]] uint32_t dummy_consumers = 0; #if PTO2_SCHED_PROFILING - sched_->on_mixed_task_complete(dummy_slot, thread_idx, local_bufs); + dummy_consumers = sched_->on_task_complete(dummy_slot, thread_idx, local_bufs).fanout_edges; #else - sched_->on_mixed_task_complete(dummy_slot, local_bufs); + dummy_consumers = sched_->on_task_complete(dummy_slot, local_bufs); +#endif +#if PTO2_PROFILING + if (dummy_resolve_t0 != 0) { + uint64_t dummy_resolve_t1 = get_sys_cnt_aicpu(); + constexpr uint64_t RESOLVE_EMIT_MIN_CYCLES = PLATFORM_PROF_SYS_CNT_FREQ / 1'000'000; // 1 µs + if (dummy_resolve_t1 - dummy_resolve_t0 >= RESOLVE_EMIT_MIN_CYCLES) { + l2_swimlane_aicpu_record_sched_phase( + thread_idx, L2SwimlaneSchedPhaseKind::Resolve, dummy_resolve_t0, dummy_resolve_t1, + sched_l2_swimlane_[thread_idx].sched_loop_count, dummy_consumers + ); + } + } #endif // Dummy tasks have no subtasks to retire and no fanout pre-conditions // beyond their own producers; release self-reference so the slot can @@ -1104,31 +1131,31 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ if (sched_->ready_queues[s].size() > 0 || local_bufs[s].count > 0) any_ready_work = true; } #if PTO2_PROFILING - bool prestage_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES; - uint64_t prestage_t0 = prestage_record ? get_sys_cnt_aicpu() : 0; + bool early_dispatch_record = l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES; + uint64_t early_dispatch_t0 = early_dispatch_record ? get_sys_cnt_aicpu() : 0; #endif - // Skip speculative prestage under PMU: dispatch_ready_tasks already withholds - // PENDING dispatch when pmu_active to preserve single-issue PMU windows, and - // prestaging gated work into idle/pending slots would perturb the same windows. - [[maybe_unused]] int32_t prestaged = (pmu_active || any_ready_work) ? 0 : try_speculative_prestage(thread_idx); + // Skip speculative early-dispatch under PMU: dispatch_ready_tasks already + // withholds PENDING dispatch when pmu_active to preserve single-issue PMU + // windows, and staging gated work into idle/pending slots would perturb the + // same windows. + [[maybe_unused]] int32_t staged_count = + (pmu_active || any_ready_work) ? 0 : try_speculative_early_dispatch(thread_idx); #if PTO2_PROFILING - // Emit a Prestage bar so a staging-dominated iteration shows as Prestage, - // not mislabeled Poll (the cheap scan also ran, so the activity-fill would - // otherwise charge the whole us-scale staging to Poll). - if (prestage_record && prestaged > 0) { + // Emit an EarlyDispatch bar so a staging-dominated iteration is attributed + // to early-dispatch rather than disappearing into a blank gap. + if (early_dispatch_record && staged_count > 0) { l2_swimlane_aicpu_record_sched_phase( - thread_idx, L2SwimlaneSchedPhaseKind::Prestage, prestage_t0, get_sys_cnt_aicpu(), - sched_l2_swimlane_[thread_idx].sched_loop_count, static_cast(prestaged) + thread_idx, L2SwimlaneSchedPhaseKind::EarlyDispatch, early_dispatch_t0, get_sys_cnt_aicpu(), + sched_l2_swimlane_[thread_idx].sched_loop_count, static_cast(staged_count) ); // prepare_block_for_dispatch bumped phase_dispatch_count while staging; - // those blocks belong to this Prestage bar, so clear the counter before it - // leaks into the next Dispatch bar. + // those blocks belong to this EarlyDispatch bar, so clear the counter + // before it leaks into the next Dispatch bar. sched_l2_swimlane_[thread_idx].phase_dispatch_count = 0; - emitted_phase_bar = true; // suppress the Poll/idle fill for this iteration } #endif - // Second completion poll. dispatch_ready_tasks + try_speculative_prestage + // Second completion poll. dispatch_ready_tasks + try_speculative_early_dispatch // above can take several us in a busy window; a producer block that FINs // during them would otherwise wait for the NEXT iteration's top-of-loop // Phase-1 poll (the ~7us detection latency that delays a flagged @@ -1191,7 +1218,6 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ l2_swimlane.phase_dispatch_count = 0; l2_swimlane.pop_hit_at_last_emit = l2_swimlane.pop_hit; l2_swimlane.pop_miss_at_last_emit = l2_swimlane.pop_miss; - emitted_phase_bar = true; } } #endif @@ -1290,42 +1316,8 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ } #endif } - -#if PTO2_PROFILING - // Activity-fill: attribute an iteration that emitted no Complete/Dispatch - // bar to the real operation that dominated it — Release (on_task_release - // drain) or Poll (completion scan that retired nothing). Genuine spin - // (neither did work) emits nothing and shows as a blank gap. Consecutive - // same-kind iterations coalesce into one bar. - if (l2_swimlane_level_ >= L2SwimlaneLevel::SCHED_PHASES) { - if (emitted_phase_bar) { - flush_activity_fill(); - } else { - uint64_t dc = l2_swimlane.sched_complete_cycle - fill_complete_at_start; - // Poll = scanned running cores, retired nothing; else genuine - // spin (blank). Release is emitted separately in the idle branch. - uint32_t kc = (dc > 0) ? (static_cast(L2SwimlaneSchedPhaseKind::Poll) + 1) : 0; - if (kc == 0) { - flush_activity_fill(); // genuine spin — leave a blank gap - } else if (l2_swimlane.fill_kind == kc) { - l2_swimlane.fill_end = _t1; - } else { - flush_activity_fill(); - l2_swimlane.fill_kind = kc; - l2_swimlane.fill_start = iter_start_ts; - l2_swimlane.fill_end = _t1; - } - } - } -#endif } -#if PTO2_PROFILING - // Close the final open activity-fill run so the trailing idle/poll stretch - // before loop exit is attributed. - flush_activity_fill(); -#endif - // Drain any entries left in the deferred-release batch. The in-loop flush // only fires on idle iterations and on buffer-full; a loop exit while the // last iteration made progress can leave entries un-released. Drop them diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h index ae42ec9fe..e4cd72272 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h @@ -388,12 +388,6 @@ struct alignas(64) SchedL2SwimlaneCounters { uint64_t pop_miss{0}; uint64_t pop_hit_at_last_emit{0}; uint64_t pop_miss_at_last_emit{0}; - // Activity-fill: coalesced segment covering iterations that emit no - // Complete/Dispatch bar, so the scheduler lane tiles with no blanks. - // fill_kind: 0 = none open, else L2SwimlaneSchedPhaseKind value + 1. - uint32_t fill_kind{0}; - uint64_t fill_start{0}; - uint64_t fill_end{0}; #if PTO2_SCHED_PROFILING uint32_t phase_wiring_count{0}; uint64_t complete_probe_count{0}; diff --git a/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h b/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h index a61f6a570..fbfcca415 100644 --- a/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h +++ b/src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h @@ -98,12 +98,20 @@ struct L2SwimlaneAicoreLocalState { * the same core — each dispatch needs its own AICore * record matched to its own AICPU record, which * task_token_raw alone cannot disambiguate. - * @param start_time Start timestamp (get_sys_cnt) + * @param receive_time Timestamp captured immediately after `read_reg(DATA_MAIN_BASE)` + * returns the new task_id (before dcci+ack). Stored as a + * 32-bit delta `start_time - receive_time` in the record; + * host recovers `receive_time = start_time - delta`. + * Lets DFX split head_OH into the unfixable + * AICPU→AICore NoC propagation (dispatch_ts → receive_time) + * and the AICore-local dcci+ack cost (receive_time → start_time). + * @param start_time Start timestamp (get_sys_cnt) — post-dcci+ack, just + * before kernel `execute_task`. * @param end_time End timestamp */ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_record_task( __gm__ L2SwimlaneActiveHead *head, L2SwimlaneAicoreLocalState *local, uint64_t task_token_raw, uint32_t reg_task_id, - uint64_t start_time, uint64_t end_time + uint64_t receive_time, uint64_t start_time, uint64_t end_time ) { // Re-fetch head channel each task; cheap relative to the // baseline `dcci(payload, ENTIRE_DATA_CACHE)` we already pay per task. @@ -133,6 +141,9 @@ __aicore__ __attribute__((always_inline)) static inline void l2_swimlane_aicore_ record->end_time = end_time; record->task_token_raw = task_token_raw; record->reg_task_id = reg_task_id; + // 32-bit delta; receive_time always precedes start_time on the same core + // (sys_cnt is monotonic per AICore), so the subtraction can never wrap. + record->receive_to_start_cycles = static_cast(start_time - receive_time); local->slot_within_buf = slot + 1; // Flush record to GM so host can read it after the buffer is enqueued. diff --git a/src/a5/platform/include/common/l2_swimlane_profiling.h b/src/a5/platform/include/common/l2_swimlane_profiling.h index 9b33101c4..ccdb0b545 100644 --- a/src/a5/platform/include/common/l2_swimlane_profiling.h +++ b/src/a5/platform/include/common/l2_swimlane_profiling.h @@ -158,16 +158,25 @@ static_assert(sizeof(L2SwimlaneAicpuTaskRecord) == 32, "L2SwimlaneAicpuTaskRecor * reg_task_id, giving a clean 1:1 join even when multiple dispatches * of the same task land on the same core. * - * Layout: 24B identity/timing + 4B reg_task_id + 4B pad → 32B (half a - * cache line). Two records pack into one cache line so AICore's per-task - * store is at most a single line commit + dcci. + * Layout: 24B identity/timing + 4B reg_task_id + 4B receive_to_start delta → + * 32B (half a cache line). Two records pack into one cache line so AICore's + * per-task store is at most a single line commit + dcci. + * + * receive_to_start_cycles isolates the AICore-side dcci+ack cost from the + * AICPU→AICore NoC propagation. AICore captures receive_time right after + * `read_reg(DATA_MAIN_BASE)` returns the new task_id (before dcci+ack), and + * start_time after them. Host derives: + * - receive_time = start_time - receive_to_start_cycles + * - propagation = receive_time - dispatch_ts (AICPU view) + * - local_setup = receive_to_start_cycles (dcci + ack) + * Delta fits in 32 bits at any platform clock (1 GHz @ 32-bit ≈ 4.3 s). */ struct L2SwimlaneAicoreTaskRecord { - uint64_t start_time; // Task start timestamp (get_sys_cnt) - uint64_t end_time; // Task end timestamp - uint64_t task_token_raw; // PTO2TaskId::raw — identity (NOT join key) - uint32_t reg_task_id; // Per-core dispatch token — host join key vs AICPU stream - uint32_t _pad; + uint64_t start_time; // Post-dcci+ack timestamp (kernel begins next) + uint64_t end_time; // Post-kernel timestamp + uint64_t task_token_raw; // PTO2TaskId::raw — identity (NOT join key) + uint32_t reg_task_id; // Per-core dispatch token — host join key vs AICPU stream + uint32_t receive_to_start_cycles; // start_time - receive_time (AICore-local dcci + ack cost) } __attribute__((aligned(32))); static_assert(sizeof(L2SwimlaneAicoreTaskRecord) == 32, "L2SwimlaneAicoreTaskRecord must be 32B"); diff --git a/src/a5/platform/shared/host/l2_swimlane_collector.cpp b/src/a5/platform/shared/host/l2_swimlane_collector.cpp index e07ad9ba0..b2dba1a85 100644 --- a/src/a5/platform/shared/host/l2_swimlane_collector.cpp +++ b/src/a5/platform/shared/host/l2_swimlane_collector.cpp @@ -764,7 +764,7 @@ int L2SwimlaneCollector::export_swimlane_json() { // file size). Column order is documented in the schema comment at the top // of swimlane_converter.py's v2 reader. // - // aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles] + // aicore_tasks: [core_id, task_token_raw, reg_task_id, start_cycles, end_cycles, receive_to_start_cycles] // aicpu_tasks: [core_id, reg_task_id, dispatch_cycles, finish_cycles] { // copy_aicore_buffer already drops r.start_time == 0 slots when @@ -776,7 +776,7 @@ int L2SwimlaneCollector::export_swimlane_json() { for (const auto &r : collected_aicore_records_[core_idx]) { if (!first) outfile << ","; outfile << "\n [" << core_idx << ", " << r.task_token_raw << ", " << r.reg_task_id << ", " - << r.start_time << ", " << r.end_time << "]"; + << r.start_time << ", " << r.end_time << ", " << r.receive_to_start_cycles << "]"; first = false; total++; } diff --git a/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp b/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp index 76dfb97e4..65b30fb50 100644 --- a/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp +++ b/src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp @@ -87,6 +87,16 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in } { + // receive_time captures the instant DATA_MAIN_BASE returned a new + // task_id, BEFORE the ack write. Paired with start_time (captured + // after task_ptr resolve) it lets DFX split head_OH into the + // AICPU→AICore NoC propagation (dispatch_ts → receive_time, + // hardware-bound) and the AICore-local ack + task_ptr resolve + // (receive_time → start_time). host_build_graph has no per-task + // dcci so the local-setup span is naturally tighter than the + // tensormap_and_ringbuffer runtime; the field still records it. + uint64_t receive_time = get_sys_cnt_aicore(); + uint32_t actual_task_id = task_id; write_reg(RegId::COND, MAKE_ACK_VALUE(actual_task_id)); @@ -121,7 +131,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in uint64_t end_time = get_sys_cnt_aicore(); l2_swimlane_aicore_record_task( l2_swimlane_head, &l2_swimlane_local, static_cast(actual_task_id), - static_cast(actual_task_id), start_time, end_time + static_cast(actual_task_id), receive_time, start_time, end_time ); } diff --git a/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp b/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp index c252dd5ac..21e79b3ed 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp @@ -130,6 +130,15 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in } { + // receive_time is captured the instant DATA_MAIN_BASE returned a + // new task_id, BEFORE the per-task dcci + ack pair. Paired with + // start_time (captured after dcci + ack) it lets DFX split head_OH + // into the AICPU→AICore NoC propagation (dispatch_ts → receive_time, + // hardware-bound) and the AICore-local dcci+ack cost + // (receive_time → start_time, software-tunable). Stored in the + // record as a 32-bit delta `start_time - receive_time`. + uint64_t receive_time = get_sys_cnt_aicore(); + uint32_t task_id = reg_val; // Decode: register holds task_id directly // First-task lazy resolve of the rotation channel. @@ -172,7 +181,7 @@ __aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime *runtime, in uint64_t end_time = get_sys_cnt_aicore(); uint64_t task_token_raw = exec_payload->local_context.async_ctx.task_token.raw; l2_swimlane_aicore_record_task( - l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, start_time, end_time + l2_swimlane_head, &l2_swimlane_local, task_token_raw, task_id, receive_time, start_time, end_time ); } diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md index 76f73da1b..e34c46ed8 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md +++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md @@ -501,7 +501,7 @@ Each scheduler thread runs a tight loop with two main phases: **Phase 1 — Completion Handling**: - Poll register `COND` on each managed core -- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark +- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, trigger `on_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark **Phase 2 — Dispatch**: diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md index fabdde65f..8cba7e90c 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md +++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md @@ -128,7 +128,7 @@ Queueing key is normalized resource shape (not raw slot label). 1. Fanin release/readiness remains dependency-correct and graph-level. 2. Two-stage completion: - `on_subtask_complete(task_id, subslot)` - - `on_mixed_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks` + - `on_task_complete(task_id)` only when `completed_subtasks == total_required_subtasks` 3. Downstream release is triggered once per mixed task completion, not once per subslot. ## 9. Executor Ownership and Numbering diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md index f2bd0aaf6..af661d440 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md +++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md @@ -109,7 +109,7 @@ The scheduler loop runs four phases each iteration. Each phase's time is accumul | Phase | What it does | Inline stats | | ----- | ------------ | ------------ | -| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | +| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to increment the completion counter; when `completed_subtasks == total_required_subtasks`, triggers `on_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | | **scan** | Updates the perf profiling header with latest scheduler state | — | | **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) | | **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — | diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h index 5c07462ea..f914bfddf 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h @@ -39,7 +39,7 @@ static_assert( // observation flattened from a DeferredCompletionEntry. TASK_NORMAL_DONE // carries the slot_state pointer in `addr` so the consumer can finalize the // AsyncWaitEntry.slot_state binding for tasks whose conditions arrived -// before the FIN thread saw mixed_complete. New kinds may be added in future +// before the FIN thread saw task_complete. New kinds may be added in future // without growing the message — the `_pad[5]` slack is reserved for // kind-specific payload extension. #define MSG_KIND_CONDITION 0u diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index e95234757..f3e5359ee 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -894,7 +894,7 @@ TaskOutputTensors PTO2OrchestratorState::alloc_tensors(const Arg &args) { if (prepared.slot_state != nullptr) { // Hidden alloc tasks complete inline in the orchestrator before any // consumer can exist, so they have no fanout to notify and no worker - // subtasks to retire. Running the full on_mixed_task_complete path + // subtasks to retire. Running the full on_task_complete path // would only pay unnecessary fanout_lock / traversal overhead here. // The generic slot initialization done in prepare_task() is still // required so scope_end can release the producer-side reference and diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h index 67ce8afa6..e5123fb2a 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h @@ -986,7 +986,7 @@ struct PTO2SchedulerState { #else void #endif - on_mixed_task_complete( + on_task_complete( PTO2TaskSlotState &slot_state, #if PTO2_SCHED_PROFILING int thread_idx, @@ -1124,9 +1124,9 @@ struct PTO2SchedulerState { inline bool AsyncWaitList::try_inline_complete_locked(AsyncWaitList::DrainCompletionSink &sink, PTO2TaskSlotState &slot_state) { #if PTO2_SCHED_PROFILING - sink.sched->on_mixed_task_complete(slot_state, sink.thread_idx, sink.local_bufs); + sink.sched->on_task_complete(slot_state, sink.thread_idx, sink.local_bufs); #else - sink.sched->on_mixed_task_complete(slot_state, sink.local_bufs); + sink.sched->on_task_complete(slot_state, sink.local_bufs); #endif if (*sink.deferred_release_count >= sink.deferred_release_capacity) { while (*sink.deferred_release_count > 0) { @@ -1204,9 +1204,9 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete( if (entry.normal_done && entry.waiting_completion_count <= 0) { #if PTO2_SCHED_PROFILING - sched->on_mixed_task_complete(*entry.slot_state, thread_idx, local_bufs); + sched->on_task_complete(*entry.slot_state, thread_idx, local_bufs); #else - sched->on_mixed_task_complete(*entry.slot_state, local_bufs); + sched->on_task_complete(*entry.slot_state, local_bufs); #endif if (deferred_release_count >= deferred_release_capacity) { while (deferred_release_count > 0) { @@ -1236,7 +1236,7 @@ inline AsyncPollResult AsyncWaitList::poll_and_complete( #if PTO2_SCHED_PROFILING struct PTO2SchedProfilingData { - // Sub-phase cycle breakdown within on_mixed_task_complete + // Sub-phase cycle breakdown within on_task_complete uint64_t lock_cycle; // lock_fanout + state store + unlock uint64_t fanout_cycle; // fanout traversal uint64_t fanin_cycle; // fanin traversal diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp index ca0854716..5893e5013 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp @@ -124,9 +124,9 @@ void SchedulerContext::complete_slot_task( } } - bool mixed_complete = sched_->on_subtask_complete(slot_state); + bool task_complete = sched_->on_subtask_complete(slot_state); - if (mixed_complete && slot_state.payload != nullptr && + if (task_complete && slot_state.payload != nullptr && slot_state.any_subtask_deferred.load(std::memory_order_acquire)) { while (!mailbox->try_push_normal_done(slot_state.task->task_id, reinterpret_cast(&slot_state))) { sched_->async_wait_list.mpsc_skipped_count.fetch_add(1, std::memory_order_relaxed); @@ -135,7 +135,7 @@ void SchedulerContext::complete_slot_task( defer_completion_to_consumer = true; } - if (mixed_complete && !defer_completion_to_consumer) { + if (task_complete && !defer_completion_to_consumer) { #if PTO2_PROFILING if (is_dump_tensor_enabled()) { dump_tensors_for_task( @@ -153,9 +153,9 @@ void SchedulerContext::complete_slot_task( // SCHED_PROFILING variant takes thread_idx for its per-thread atomic // counter side-effects (g_sched_*_atomic_count[thread_idx], consumed // by the otc_* log lines). Its return value is unused. - (void)sched_->on_mixed_task_complete(slot_state, thread_idx, local_bufs); + (void)sched_->on_task_complete(slot_state, thread_idx, local_bufs); #else - sched_->on_mixed_task_complete(slot_state, local_bufs); + sched_->on_task_complete(slot_state, local_bufs); #endif #if PTO2_PROFILING l2_swimlane.phase_complete_count++; diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp index f4335c5ac..cd252c9ae 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp @@ -750,9 +750,9 @@ int32_t SchedulerContext::resolve_and_dispatch(Runtime *runtime, int32_t thread_ for (int di = 0; di < dummy_got; di++) { PTO2TaskSlotState &dummy_slot = *dummy_batch[di]; #if PTO2_SCHED_PROFILING - sched_->on_mixed_task_complete(dummy_slot, thread_idx, local_bufs); + sched_->on_task_complete(dummy_slot, thread_idx, local_bufs); #else - sched_->on_mixed_task_complete(dummy_slot, local_bufs); + sched_->on_task_complete(dummy_slot, local_bufs); #endif // Dummy tasks have no subtasks to retire and no fanout pre-conditions // beyond their own producers; release self-reference so the slot can diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py index 680ca1ef4..13efeadd2 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dfx/l2_swimlane/_swimlane_validate.py @@ -31,7 +31,19 @@ from simpler_setup.scene_test import _outputs_dir, _sanitize_for_filename from simpler_setup.tools.swimlane_converter import read_perf_data -_REQUIRED_TASK_FIELDS = ("task_id", "func_id", "core_id", "core_type", "start_time_us", "end_time_us") +_REQUIRED_TASK_FIELDS = ( + "task_id", + "func_id", + "core_id", + "core_type", + "start_time_us", + "end_time_us", + # receive_time_us / local_setup_us are populated unconditionally by the + # AICore-side capture (v3 schema). propagation_us requires AICPU dispatch_ts + # and is therefore only present at level≥2 — not in this required-set. + "receive_time_us", + "local_setup_us", +) def validate_perf_artifact(case_label: str, *, expected_task_count: int | None = None) -> None: diff --git a/tests/ut/cpp/a2a3/test_wiring.cpp b/tests/ut/cpp/a2a3/test_wiring.cpp index 0d45762b2..9acb8aacf 100644 --- a/tests/ut/cpp/a2a3/test_wiring.cpp +++ b/tests/ut/cpp/a2a3/test_wiring.cpp @@ -13,7 +13,7 @@ * * 1. wire_task() — fanout wiring, early-finished detection, * fanin_count initialization, ready push - * 2. on_mixed_task_complete() — COMPLETED transition, fanout traversal, + * 2. on_task_complete() — COMPLETED transition, fanout traversal, * consumer fanin release * 3. on_task_release() — fanin traversal, producer release, * self-CONSUMED check @@ -241,7 +241,7 @@ TEST_F(WiringTest, WireTaskMixedProducerStates) { } // ============================================================================= -// on_mixed_task_complete: notifies consumers via fanout chain +// on_task_complete: notifies consumers via fanout chain // ============================================================================= TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) { @@ -274,7 +274,7 @@ TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) { dep_entries[1].next = &dep_entries[0]; producer.fanout_head = &dep_entries[1]; - sched.on_mixed_task_complete(producer); + sched.on_task_complete(producer); // Producer should be COMPLETED EXPECT_EQ(producer.task_state.load(), PTO2_TASK_COMPLETED); diff --git a/tests/ut/cpp/a5/test_wiring.cpp b/tests/ut/cpp/a5/test_wiring.cpp index faffc6a9d..6c8904f81 100644 --- a/tests/ut/cpp/a5/test_wiring.cpp +++ b/tests/ut/cpp/a5/test_wiring.cpp @@ -13,7 +13,7 @@ * * 1. wire_task() — fanout wiring, early-finished detection, * fanin_count initialization, ready push - * 2. on_mixed_task_complete() — COMPLETED transition, fanout traversal, + * 2. on_task_complete() — COMPLETED transition, fanout traversal, * consumer fanin release * 3. on_task_release() — fanin traversal, producer release, * self-CONSUMED check @@ -231,7 +231,7 @@ TEST_F(WiringTest, WireTaskMixedProducerStates) { } // ============================================================================= -// on_mixed_task_complete: notifies consumers via fanout chain +// on_task_complete: notifies consumers via fanout chain // ============================================================================= TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) { @@ -264,7 +264,7 @@ TEST_F(WiringTest, OnMixedTaskCompleteNotifiesConsumers) { dep_entries[1].next = &dep_entries[0]; producer.fanout_head = &dep_entries[1]; - sched.on_mixed_task_complete(producer); + sched.on_task_complete(producer); // Producer should be COMPLETED EXPECT_EQ(producer.task_state.load(), PTO2_TASK_COMPLETED);