[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup by nvidianz · Pull Request #4558 · NVIDIA/NVFlare

nvidianz · 2026-05-08T14:39:24Z

Summary

At every job start on 2.8, the per-site job log records a misleading line:

_LogTailProducer - ERROR - error from target server: EXECUTION_EXCEPTION

It is a symptom of a cross-process race introduced by the live-log-streaming feature in #4454: the client's JobLogStreamer._on_job_started fires on START_RUN and immediately sends its first chunk, but JobLogReceiver._register also fires on START_RUN in a separate server-side job subprocess. If the client wins the race, the server's stream_runner has no factory for channel='log_streaming' topic='live_log' registered yet and replies EXECUTION_EXCEPTION.

The base BaseChunkProducer.process_replies treats any non-OK reply as fatal: it logs ERROR and returns False, which stream_runner (nvflare/private/stream_runner.py:461-475) then interprets as end-of-stream — so the entire job's live-log stream is torn down on the very first chunk. The user-visible ERROR line is only one symptom; the silent loss of every subsequent log byte is the other.

This breaks any CI/QA pipeline that asserts per-site logs contain no ERROR / Traceback / Exception strings (the bug reports it turning 12/17 multi-study container testcases into failures), and it silently disables the live-log feature for affected jobs.

Fix

Bootstrap-tolerant _LogTailProducer (primary fix) — override process_replies so that until the producer has seen the first OK reply, a uniform EXECUTION_EXCEPTION is treated as a transient miss: the file offset is rolled back to before the failed read so the same bytes will be re-emitted on the next produce(), the failure is logged at debug rather than ERROR, and None is returned to keep the stream alive. After the first OK reply, falls through to the base behavior so genuine receiver failures still surface as ERROR and end the stream. Heartbeats don't move the file pointer, so no seek is attempted for them.
Earlier receiver registration (defense in depth) — JobLogReceiver now registers on ABOUT_TO_START_RUN instead of START_RUN, so in the common case the handler is wired up before the client's first chunk arrives. Cross-process timing means this alone can't fully close the race, hence (1) is the real fix.

Why this shape

Surgical: changes are scoped to _LogTailProducer and JobLogReceiver. BaseChunkProducer and other streamers (FileStreamer, TensorProducer, etc.) are untouched, so genuine EXECUTION_EXCEPTION errors still surface where they should.
Recovers lost bytes via seek instead of just hiding the symptom, so the live-log feature actually works on affected jobs.
Race-free regardless of cross-process event ordering — no reliance on SP_JOB / parent / future deploy modes happening in any particular sequence.
Explicit "first send only" gate via _first_ok_received — no risk of swallowing real receiver crashes mid-job.

Behavior matrix

Scenario	Before	After
First chunk OK	OK	OK
First chunk transient `EXECUTION_EXCEPTION`, then OK	ERROR logged, stream torn down, rest of log lost	bytes re-emitted, stream continues, no ERROR
`EXECUTION_EXCEPTION` after first OK reply	ERROR logged, stream ends	unchanged (ERROR logged, stream ends)
Persistent `EXECUTION_EXCEPTION` (real receiver failure)	ERROR on first chunk, stream ends	second `EXECUTION_EXCEPTION` (post-first-OK gate flips) → ERROR, stream ends

Tests

New tests in tests/unit_test/app_common/streamers/log_streamer_test.py:

test_first_chunk_execution_exception_is_transient_and_retries — bootstrap miss rolls offset back, no ERROR logged, bytes re-emitted on next produce().
test_first_chunk_execution_exception_on_heartbeat_does_not_seek — heartbeats don't trigger seek (last_was_data=False).
test_execution_exception_after_first_ok_still_ends_stream — steady-state strict failure preserved.
test_first_chunk_ok_flips_strict_mode — first OK reply flips the producer into strict mode.

Verified locally

pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging -v → 43 passed (Python 3.12.10, pytest 8.3.4) — 4 new + 39 existing, no regressions.
black, isort, flake8 clean on all changed files.

Test plan

CI: license + style + full unit test suite green on the 2.8 base.
Manual: reproduce the bug's POC repro on 2.8 head, confirm grep ERROR /tmp/nvflare/poc/example_project/prod_00/site-1/$JOB_ID/log.txt returns no lines after the fix.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running unit tests for app_common/streamers/ and app_common/logging/.
In-line docstrings updated.
Documentation updated.

🤖 Generated with Claude Code

NVBugs 6140553. At every job start, the per-site log records: _LogTailProducer - ERROR - error from target server: EXECUTION_EXCEPTION The misleading line is a symptom of a cross-process race: the client's JobLogStreamer fires its first chunk on START_RUN immediately, but the server's JobLogReceiver also registers its stream handler on START_RUN in a separate (server-side) job subprocess. If the client wins the race, the server-side stream_runner has no factory for channel='log_streaming' topic='live_log' yet and replies EXECUTION_EXCEPTION. The base BaseChunkProducer.process_replies treats any non-OK reply as a fatal error: it logs ERROR and returns False, which stream_runner then interprets as end-of-stream — so the entire job's live log streaming is torn down on the very first chunk. The user-visible ERROR line is one symptom; the silent loss of every subsequent log byte is the other. Fix: 1. Bootstrap-tolerant _LogTailProducer (primary fix). Override process_replies so that until we have seen the first OK reply, a uniform EXECUTION_EXCEPTION is treated as a transient miss: the file offset is rolled back to before the failed read so the same bytes will be re-emitted on the next produce(), and the failure is logged at debug rather than ERROR, with None returned to keep the stream alive. After the first OK reply, fall through to the base behavior so genuine receiver failures still surface as ERROR and tear the stream down. Heartbeats don't move the file pointer, so no seek is attempted for them. 2. Earlier receiver registration (defense in depth). JobLogReceiver now registers on ABOUT_TO_START_RUN instead of START_RUN, so in the common case the handler is wired up before the client's first chunk even arrives. Cross-process timing means this alone can't fully close the race, hence (1) is the real fix. Tests: - New: first chunk EXECUTION_EXCEPTION rolls the file offset back and re-emits the same bytes on the next produce() (no data lost, no ERROR logged, stream stays alive). - New: first chunk EXECUTION_EXCEPTION on a heartbeat doesn't attempt a seek (last_was_data=False path). - New: EXECUTION_EXCEPTION after the first OK reply still ends the stream (steady-state strict behavior preserved). - New: an OK reply flips the producer into strict mode so the next failure is no longer treated as a bootstrap miss. - 39 pre-existing tests under tests/unit_test/app_common/{streamers,logging}/ continue to pass unchanged. Verified locally: - pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging -> 43 passed (Python 3.12.10, pytest 8.3.4). - black, isort, flake8 clean on all changed files.

greptile-apps · 2026-05-08T14:42:15Z

Greptile Summary

This PR fixes a cross-process startup race in the live-log-streaming feature where JobLogReceiver could receive the client's first chunk before its stream handler was registered, causing a misleading EXECUTION_EXCEPTION ERROR log line and silently tearing down the log stream for the entire job.

JobLogReceiver now registers its handler on ABOUT_TO_START_RUN instead of START_RUN to narrow the race window in the common case.
_LogTailProducer overrides process_replies to treat a uniform EXECUTION_EXCEPTION as transient until the first OK reply is seen, rolling the file offset back so the same bytes are re-emitted on the next produce() rather than lost.
Four new unit tests cover the bootstrap-tolerance paths; however, the case where EXECUTION_EXCEPTION is returned persistently and no OK is ever received is not tested \u2014 in that scenario the stream retries indefinitely rather than surfacing an error.

Confidence Score: 3/5

Safe for the targeted race fix, but a truly misconfigured or permanently broken receiver will silently retry forever instead of emitting an error that would alert an operator.

The primary fix correctly handles the transient race and the ABOUT_TO_START_RUN change is straightforwardly safe. The gap is that _first_ok_received only flips on the first OK reply: if the receiver never replies OK due to a broken config or wrong channel/topic, the transient branch always matches and the stream retries indefinitely. The behavior matrix in the PR description omits this persistent-failure-with-no-OK case entirely.

nvflare/app_common/streamers/log_streamer.py — the unbounded retry path in process_replies needs a miss counter or similar guard before merging.

Important Files Changed

Filename	Overview
nvflare/app_common/streamers/log_streamer.py	Adds bootstrap-tolerant process_replies override to _LogTailProducer; the transient-miss window has no upper bound, so a genuinely broken receiver is silently retried forever rather than surfaced as an error.
nvflare/app_common/logging/job_log_receiver.py	Switches job-level handler registration from START_RUN to ABOUT_TO_START_RUN to narrow the cross-process race window; change is correct and low-risk.
tests/unit_test/app_common/streamers/log_streamer_test.py	Adds four targeted unit tests for the new bootstrap-tolerance logic; the persistent-failure-with-no-OK-ever scenario is not covered by any of the new tests.

Sequence Diagram

sequenceDiagram
    participant C as Client (JobLogStreamer)
    participant S as Server (JobLogReceiver)

    Note over C,S: Before fix — START_RUN race
    C->>S: START_RUN fires → first chunk sent
    Note over S: Handler not yet registered
    S-->>C: EXECUTION_EXCEPTION
    C->>C: BaseChunkProducer: log ERROR, return False
    Note over C,S: Stream torn down, log bytes lost

    Note over C,S: After fix — ABOUT_TO_START_RUN + bootstrap tolerance
    S->>S: ABOUT_TO_START_RUN → register handler (earlier)
    C->>S: START_RUN fires → first chunk sent
    alt Handler already registered (common case)
        S-->>C: ReturnCode.OK
        C->>C: "_first_ok_received = True (strict mode)"
    else Handler still racing (rare)
        S-->>C: EXECUTION_EXCEPTION
        C->>C: process_replies: seek back, return None (keep alive)
        C->>S: retry same chunk
        S-->>C: ReturnCode.OK (handler now ready)
        C->>C: "_first_ok_received = True (strict mode)"
    end
    loop Subsequent chunks
        C->>S: data chunk
        S-->>C: ReturnCode.OK
    end
    C->>S: EOF

_{Reviews (1): Last reviewed commit: "[2.8] Fix _LogTailProducer EXECUTION_EXC..." | Re-trigger Greptile}

greptile-apps · 2026-05-08T14:42:18Z

+        if not self._first_ok_received and replies:
+            transient = all(
+                reply.get_return_code(ReturnCode.OK) == ReturnCode.EXECUTION_EXCEPTION for reply in replies.values()
+            )
+            if transient:
+                if self._last_was_data and self.file is not None and self._last_data_offset is not None:
+                    try:
+                        self.file.seek(self._last_data_offset)
+                    except OSError as e:
+                        self.logger.warning(f"could not seek back after bootstrap miss: {e}")
+                self.logger.debug(
+                    f"transient bootstrap miss on first chunk to {list(replies)}; rolling back and retrying"
+                )
+                return None  # keep producing — receiver should be ready by next send


Persistent receiver failure silently retries forever

_first_ok_received flips to True only on the first OK reply. If the receiver is genuinely broken (misconfigured, wrong channel/topic, never registers), the gate never flips: every EXECUTION_EXCEPTION passes the not self._first_ok_received and transient check and returns None indefinitely, keeping the stream alive and re-seeking the same bytes in an infinite loop. The pre-fix behavior — ERROR on the first non-OK reply, stream ends — would at least surface the failure. The PR's behavior matrix documents "second EXECUTION_EXCEPTION (post-first-OK gate flips) → ERROR, stream ends", but that path requires seeing an OK first; a truly persistent failure with no OK is unhandled.

A simple bound (e.g., _bootstrap_miss_count capped at a small value, say 10) would preserve the retry benefit for transient races while ensuring a genuine receiver failure eventually surfaces as an error.

greptile-apps · 2026-05-08T14:42:19Z

+        # Tolerate exactly one bootstrap miss at job startup. The receiver
+        # registers its handler on START_RUN in the server's job subprocess;


Stale comment: the receiver was changed in this same PR to register on ABOUT_TO_START_RUN, not START_RUN. Leaving the old event name here will mislead future readers trying to understand the race condition.

Suggested change

# Tolerate exactly one bootstrap miss at job startup. The receiver

# registers its handler on START_RUN in the server's job subprocess;

# Tolerate exactly one bootstrap miss at job startup. The receiver

# registers its handler on ABOUT_TO_START_RUN in the server's job subprocess;

nvidianz requested review from IsaacYangSLA, YuanTingHsieh and pcnudde May 8, 2026 14:40

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558

[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558
nvidianz wants to merge 1 commit intoNVIDIA:2.8from
nvidianz:2.8-fix-logtail-bootstrap-race

nvidianz commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Tolerate exactly one bootstrap miss at job startup. The receiver
		# registers its handler on START_RUN in the server's job subprocess;

Conversation

nvidianz commented May 8, 2026

Summary

Fix

Why this shape

Behavior matrix

Tests

Verified locally

Test plan

Types of changes

Uh oh!

greptile-apps Bot commented May 8, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant