[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558
[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558nvidianz wants to merge 1 commit intoNVIDIA:2.8from
Conversation
NVBugs 6140553. At every job start, the per-site log records:
_LogTailProducer - ERROR - error from target server: EXECUTION_EXCEPTION
The misleading line is a symptom of a cross-process race: the client's
JobLogStreamer fires its first chunk on START_RUN immediately, but the
server's JobLogReceiver also registers its stream handler on START_RUN
in a separate (server-side) job subprocess. If the client wins the race,
the server-side stream_runner has no factory for channel='log_streaming'
topic='live_log' yet and replies EXECUTION_EXCEPTION.
The base BaseChunkProducer.process_replies treats any non-OK reply as a
fatal error: it logs ERROR and returns False, which stream_runner then
interprets as end-of-stream — so the entire job's live log streaming is
torn down on the very first chunk. The user-visible ERROR line is one
symptom; the silent loss of every subsequent log byte is the other.
Fix:
1. Bootstrap-tolerant _LogTailProducer (primary fix). Override
process_replies so that until we have seen the first OK reply, a
uniform EXECUTION_EXCEPTION is treated as a transient miss: the file
offset is rolled back to before the failed read so the same bytes
will be re-emitted on the next produce(), and the failure is logged
at debug rather than ERROR, with None returned to keep the stream
alive. After the first OK reply, fall through to the base behavior so
genuine receiver failures still surface as ERROR and tear the stream
down. Heartbeats don't move the file pointer, so no seek is attempted
for them.
2. Earlier receiver registration (defense in depth). JobLogReceiver now
registers on ABOUT_TO_START_RUN instead of START_RUN, so in the
common case the handler is wired up before the client's first chunk
even arrives. Cross-process timing means this alone can't fully
close the race, hence (1) is the real fix.
Tests:
- New: first chunk EXECUTION_EXCEPTION rolls the file offset back and
re-emits the same bytes on the next produce() (no data lost, no ERROR
logged, stream stays alive).
- New: first chunk EXECUTION_EXCEPTION on a heartbeat doesn't attempt a
seek (last_was_data=False path).
- New: EXECUTION_EXCEPTION after the first OK reply still ends the
stream (steady-state strict behavior preserved).
- New: an OK reply flips the producer into strict mode so the next
failure is no longer treated as a bootstrap miss.
- 39 pre-existing tests under tests/unit_test/app_common/{streamers,logging}/
continue to pass unchanged.
Verified locally:
- pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging
-> 43 passed (Python 3.12.10, pytest 8.3.4).
- black, isort, flake8 clean on all changed files.
Greptile SummaryThis PR fixes a cross-process startup race in the live-log-streaming feature where
Confidence Score: 3/5Safe for the targeted race fix, but a truly misconfigured or permanently broken receiver will silently retry forever instead of emitting an error that would alert an operator. The primary fix correctly handles the transient race and the ABOUT_TO_START_RUN change is straightforwardly safe. The gap is that _first_ok_received only flips on the first OK reply: if the receiver never replies OK due to a broken config or wrong channel/topic, the transient branch always matches and the stream retries indefinitely. The behavior matrix in the PR description omits this persistent-failure-with-no-OK case entirely. nvflare/app_common/streamers/log_streamer.py — the unbounded retry path in process_replies needs a miss counter or similar guard before merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Client (JobLogStreamer)
participant S as Server (JobLogReceiver)
Note over C,S: Before fix — START_RUN race
C->>S: START_RUN fires → first chunk sent
Note over S: Handler not yet registered
S-->>C: EXECUTION_EXCEPTION
C->>C: BaseChunkProducer: log ERROR, return False
Note over C,S: Stream torn down, log bytes lost
Note over C,S: After fix — ABOUT_TO_START_RUN + bootstrap tolerance
S->>S: ABOUT_TO_START_RUN → register handler (earlier)
C->>S: START_RUN fires → first chunk sent
alt Handler already registered (common case)
S-->>C: ReturnCode.OK
C->>C: "_first_ok_received = True (strict mode)"
else Handler still racing (rare)
S-->>C: EXECUTION_EXCEPTION
C->>C: process_replies: seek back, return None (keep alive)
C->>S: retry same chunk
S-->>C: ReturnCode.OK (handler now ready)
C->>C: "_first_ok_received = True (strict mode)"
end
loop Subsequent chunks
C->>S: data chunk
S-->>C: ReturnCode.OK
end
C->>S: EOF
Reviews (1): Last reviewed commit: "[2.8] Fix _LogTailProducer EXECUTION_EXC..." | Re-trigger Greptile |
| if not self._first_ok_received and replies: | ||
| transient = all( | ||
| reply.get_return_code(ReturnCode.OK) == ReturnCode.EXECUTION_EXCEPTION for reply in replies.values() | ||
| ) | ||
| if transient: | ||
| if self._last_was_data and self.file is not None and self._last_data_offset is not None: | ||
| try: | ||
| self.file.seek(self._last_data_offset) | ||
| except OSError as e: | ||
| self.logger.warning(f"could not seek back after bootstrap miss: {e}") | ||
| self.logger.debug( | ||
| f"transient bootstrap miss on first chunk to {list(replies)}; rolling back and retrying" | ||
| ) | ||
| return None # keep producing — receiver should be ready by next send |
There was a problem hiding this comment.
Persistent receiver failure silently retries forever
_first_ok_received flips to True only on the first OK reply. If the receiver is genuinely broken (misconfigured, wrong channel/topic, never registers), the gate never flips: every EXECUTION_EXCEPTION passes the not self._first_ok_received and transient check and returns None indefinitely, keeping the stream alive and re-seeking the same bytes in an infinite loop. The pre-fix behavior — ERROR on the first non-OK reply, stream ends — would at least surface the failure. The PR's behavior matrix documents "second EXECUTION_EXCEPTION (post-first-OK gate flips) → ERROR, stream ends", but that path requires seeing an OK first; a truly persistent failure with no OK is unhandled.
A simple bound (e.g., _bootstrap_miss_count capped at a small value, say 10) would preserve the retry benefit for transient races while ensuring a genuine receiver failure eventually surfaces as an error.
| # Tolerate exactly one bootstrap miss at job startup. The receiver | ||
| # registers its handler on START_RUN in the server's job subprocess; |
There was a problem hiding this comment.
Stale comment: the receiver was changed in this same PR to register on
ABOUT_TO_START_RUN, not START_RUN. Leaving the old event name here will mislead future readers trying to understand the race condition.
| # Tolerate exactly one bootstrap miss at job startup. The receiver | |
| # registers its handler on START_RUN in the server's job subprocess; | |
| # Tolerate exactly one bootstrap miss at job startup. The receiver | |
| # registers its handler on ABOUT_TO_START_RUN in the server's job subprocess; |
Summary
At every job start on 2.8, the per-site job log records a misleading line:
It is a symptom of a cross-process race introduced by the live-log-streaming feature in #4454: the client's
JobLogStreamer._on_job_startedfires onSTART_RUNand immediately sends its first chunk, butJobLogReceiver._registeralso fires onSTART_RUNin a separate server-side job subprocess. If the client wins the race, the server'sstream_runnerhas no factory forchannel='log_streaming' topic='live_log'registered yet and repliesEXECUTION_EXCEPTION.The base
BaseChunkProducer.process_repliestreats any non-OK reply as fatal: it logsERRORand returnsFalse, whichstream_runner(nvflare/private/stream_runner.py:461-475) then interprets as end-of-stream — so the entire job's live-log stream is torn down on the very first chunk. The user-visible ERROR line is only one symptom; the silent loss of every subsequent log byte is the other.This breaks any CI/QA pipeline that asserts per-site logs contain no
ERROR/Traceback/Exceptionstrings (the bug reports it turning 12/17 multi-study container testcases into failures), and it silently disables the live-log feature for affected jobs.Fix
_LogTailProducer(primary fix) — overrideprocess_repliesso that until the producer has seen the first OK reply, a uniformEXECUTION_EXCEPTIONis treated as a transient miss: the file offset is rolled back to before the failed read so the same bytes will be re-emitted on the nextproduce(), the failure is logged at debug rather than ERROR, andNoneis returned to keep the stream alive. After the first OK reply, falls through to the base behavior so genuine receiver failures still surface as ERROR and end the stream. Heartbeats don't move the file pointer, so no seek is attempted for them.JobLogReceivernow registers onABOUT_TO_START_RUNinstead ofSTART_RUN, so in the common case the handler is wired up before the client's first chunk arrives. Cross-process timing means this alone can't fully close the race, hence (1) is the real fix.Why this shape
_LogTailProducerandJobLogReceiver.BaseChunkProducerand other streamers (FileStreamer,TensorProducer, etc.) are untouched, so genuineEXECUTION_EXCEPTIONerrors still surface where they should.seekinstead of just hiding the symptom, so the live-log feature actually works on affected jobs._first_ok_received— no risk of swallowing real receiver crashes mid-job.Behavior matrix
EXECUTION_EXCEPTION, then OKEXECUTION_EXCEPTIONafter first OK replyEXECUTION_EXCEPTION(real receiver failure)EXECUTION_EXCEPTION(post-first-OK gate flips) → ERROR, stream endsTests
New tests in
tests/unit_test/app_common/streamers/log_streamer_test.py:test_first_chunk_execution_exception_is_transient_and_retries— bootstrap miss rolls offset back, no ERROR logged, bytes re-emitted on nextproduce().test_first_chunk_execution_exception_on_heartbeat_does_not_seek— heartbeats don't trigger seek (last_was_data=False).test_execution_exception_after_first_ok_still_ends_stream— steady-state strict failure preserved.test_first_chunk_ok_flips_strict_mode— first OK reply flips the producer into strict mode.Verified locally
pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging -v→ 43 passed (Python 3.12.10, pytest 8.3.4) — 4 new + 39 existing, no regressions.black,isort,flake8clean on all changed files.Test plan
grep ERROR /tmp/nvflare/poc/example_project/prod_00/site-1/$JOB_ID/log.txtreturns no lines after the fix.Types of changes
app_common/streamers/andapp_common/logging/.🤖 Generated with Claude Code