Skip to content

[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558

Open
nvidianz wants to merge 1 commit intoNVIDIA:2.8from
nvidianz:2.8-fix-logtail-bootstrap-race
Open

[2.8] Fix _LogTailProducer EXECUTION_EXCEPTION ERROR at job startup#4558
nvidianz wants to merge 1 commit intoNVIDIA:2.8from
nvidianz:2.8-fix-logtail-bootstrap-race

Conversation

@nvidianz
Copy link
Copy Markdown
Collaborator

@nvidianz nvidianz commented May 8, 2026

Summary

At every job start on 2.8, the per-site job log records a misleading line:

_LogTailProducer - ERROR - error from target server: EXECUTION_EXCEPTION

It is a symptom of a cross-process race introduced by the live-log-streaming feature in #4454: the client's JobLogStreamer._on_job_started fires on START_RUN and immediately sends its first chunk, but JobLogReceiver._register also fires on START_RUN in a separate server-side job subprocess. If the client wins the race, the server's stream_runner has no factory for channel='log_streaming' topic='live_log' registered yet and replies EXECUTION_EXCEPTION.

The base BaseChunkProducer.process_replies treats any non-OK reply as fatal: it logs ERROR and returns False, which stream_runner (nvflare/private/stream_runner.py:461-475) then interprets as end-of-stream — so the entire job's live-log stream is torn down on the very first chunk. The user-visible ERROR line is only one symptom; the silent loss of every subsequent log byte is the other.

This breaks any CI/QA pipeline that asserts per-site logs contain no ERROR / Traceback / Exception strings (the bug reports it turning 12/17 multi-study container testcases into failures), and it silently disables the live-log feature for affected jobs.

Fix

  1. Bootstrap-tolerant _LogTailProducer (primary fix) — override process_replies so that until the producer has seen the first OK reply, a uniform EXECUTION_EXCEPTION is treated as a transient miss: the file offset is rolled back to before the failed read so the same bytes will be re-emitted on the next produce(), the failure is logged at debug rather than ERROR, and None is returned to keep the stream alive. After the first OK reply, falls through to the base behavior so genuine receiver failures still surface as ERROR and end the stream. Heartbeats don't move the file pointer, so no seek is attempted for them.
  2. Earlier receiver registration (defense in depth)JobLogReceiver now registers on ABOUT_TO_START_RUN instead of START_RUN, so in the common case the handler is wired up before the client's first chunk arrives. Cross-process timing means this alone can't fully close the race, hence (1) is the real fix.

Why this shape

  • Surgical: changes are scoped to _LogTailProducer and JobLogReceiver. BaseChunkProducer and other streamers (FileStreamer, TensorProducer, etc.) are untouched, so genuine EXECUTION_EXCEPTION errors still surface where they should.
  • Recovers lost bytes via seek instead of just hiding the symptom, so the live-log feature actually works on affected jobs.
  • Race-free regardless of cross-process event ordering — no reliance on SP_JOB / parent / future deploy modes happening in any particular sequence.
  • Explicit "first send only" gate via _first_ok_received — no risk of swallowing real receiver crashes mid-job.

Behavior matrix

Scenario Before After
First chunk OK OK OK
First chunk transient EXECUTION_EXCEPTION, then OK ERROR logged, stream torn down, rest of log lost bytes re-emitted, stream continues, no ERROR
EXECUTION_EXCEPTION after first OK reply ERROR logged, stream ends unchanged (ERROR logged, stream ends)
Persistent EXECUTION_EXCEPTION (real receiver failure) ERROR on first chunk, stream ends second EXECUTION_EXCEPTION (post-first-OK gate flips) → ERROR, stream ends

Tests

New tests in tests/unit_test/app_common/streamers/log_streamer_test.py:

  • test_first_chunk_execution_exception_is_transient_and_retries — bootstrap miss rolls offset back, no ERROR logged, bytes re-emitted on next produce().
  • test_first_chunk_execution_exception_on_heartbeat_does_not_seek — heartbeats don't trigger seek (last_was_data=False).
  • test_execution_exception_after_first_ok_still_ends_stream — steady-state strict failure preserved.
  • test_first_chunk_ok_flips_strict_mode — first OK reply flips the producer into strict mode.

Verified locally

  • pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging -v43 passed (Python 3.12.10, pytest 8.3.4) — 4 new + 39 existing, no regressions.
  • black, isort, flake8 clean on all changed files.

Test plan

  • CI: license + style + full unit test suite green on the 2.8 base.
  • Manual: reproduce the bug's POC repro on 2.8 head, confirm grep ERROR /tmp/nvflare/poc/example_project/prod_00/site-1/$JOB_ID/log.txt returns no lines after the fix.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running unit tests for app_common/streamers/ and app_common/logging/.
  • In-line docstrings updated.
  • Documentation updated.

🤖 Generated with Claude Code

NVBugs 6140553. At every job start, the per-site log records:

  _LogTailProducer - ERROR - error from target server: EXECUTION_EXCEPTION

The misleading line is a symptom of a cross-process race: the client's
JobLogStreamer fires its first chunk on START_RUN immediately, but the
server's JobLogReceiver also registers its stream handler on START_RUN
in a separate (server-side) job subprocess. If the client wins the race,
the server-side stream_runner has no factory for channel='log_streaming'
topic='live_log' yet and replies EXECUTION_EXCEPTION.

The base BaseChunkProducer.process_replies treats any non-OK reply as a
fatal error: it logs ERROR and returns False, which stream_runner then
interprets as end-of-stream — so the entire job's live log streaming is
torn down on the very first chunk. The user-visible ERROR line is one
symptom; the silent loss of every subsequent log byte is the other.

Fix:

1. Bootstrap-tolerant _LogTailProducer (primary fix). Override
   process_replies so that until we have seen the first OK reply, a
   uniform EXECUTION_EXCEPTION is treated as a transient miss: the file
   offset is rolled back to before the failed read so the same bytes
   will be re-emitted on the next produce(), and the failure is logged
   at debug rather than ERROR, with None returned to keep the stream
   alive. After the first OK reply, fall through to the base behavior so
   genuine receiver failures still surface as ERROR and tear the stream
   down. Heartbeats don't move the file pointer, so no seek is attempted
   for them.

2. Earlier receiver registration (defense in depth). JobLogReceiver now
   registers on ABOUT_TO_START_RUN instead of START_RUN, so in the
   common case the handler is wired up before the client's first chunk
   even arrives. Cross-process timing means this alone can't fully
   close the race, hence (1) is the real fix.

Tests:

- New: first chunk EXECUTION_EXCEPTION rolls the file offset back and
  re-emits the same bytes on the next produce() (no data lost, no ERROR
  logged, stream stays alive).
- New: first chunk EXECUTION_EXCEPTION on a heartbeat doesn't attempt a
  seek (last_was_data=False path).
- New: EXECUTION_EXCEPTION after the first OK reply still ends the
  stream (steady-state strict behavior preserved).
- New: an OK reply flips the producer into strict mode so the next
  failure is no longer treated as a bootstrap miss.
- 39 pre-existing tests under tests/unit_test/app_common/{streamers,logging}/
  continue to pass unchanged.

Verified locally:
- pytest tests/unit_test/app_common/streamers tests/unit_test/app_common/logging
  -> 43 passed (Python 3.12.10, pytest 8.3.4).
- black, isort, flake8 clean on all changed files.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR fixes a cross-process startup race in the live-log-streaming feature where JobLogReceiver could receive the client's first chunk before its stream handler was registered, causing a misleading EXECUTION_EXCEPTION ERROR log line and silently tearing down the log stream for the entire job.

  • JobLogReceiver now registers its handler on ABOUT_TO_START_RUN instead of START_RUN to narrow the race window in the common case.
  • _LogTailProducer overrides process_replies to treat a uniform EXECUTION_EXCEPTION as transient until the first OK reply is seen, rolling the file offset back so the same bytes are re-emitted on the next produce() rather than lost.
  • Four new unit tests cover the bootstrap-tolerance paths; however, the case where EXECUTION_EXCEPTION is returned persistently and no OK is ever received is not tested \u2014 in that scenario the stream retries indefinitely rather than surfacing an error.

Confidence Score: 3/5

Safe for the targeted race fix, but a truly misconfigured or permanently broken receiver will silently retry forever instead of emitting an error that would alert an operator.

The primary fix correctly handles the transient race and the ABOUT_TO_START_RUN change is straightforwardly safe. The gap is that _first_ok_received only flips on the first OK reply: if the receiver never replies OK due to a broken config or wrong channel/topic, the transient branch always matches and the stream retries indefinitely. The behavior matrix in the PR description omits this persistent-failure-with-no-OK case entirely.

nvflare/app_common/streamers/log_streamer.py — the unbounded retry path in process_replies needs a miss counter or similar guard before merging.

Important Files Changed

Filename Overview
nvflare/app_common/streamers/log_streamer.py Adds bootstrap-tolerant process_replies override to _LogTailProducer; the transient-miss window has no upper bound, so a genuinely broken receiver is silently retried forever rather than surfaced as an error.
nvflare/app_common/logging/job_log_receiver.py Switches job-level handler registration from START_RUN to ABOUT_TO_START_RUN to narrow the cross-process race window; change is correct and low-risk.
tests/unit_test/app_common/streamers/log_streamer_test.py Adds four targeted unit tests for the new bootstrap-tolerance logic; the persistent-failure-with-no-OK-ever scenario is not covered by any of the new tests.

Sequence Diagram

sequenceDiagram
    participant C as Client (JobLogStreamer)
    participant S as Server (JobLogReceiver)

    Note over C,S: Before fix — START_RUN race
    C->>S: START_RUN fires → first chunk sent
    Note over S: Handler not yet registered
    S-->>C: EXECUTION_EXCEPTION
    C->>C: BaseChunkProducer: log ERROR, return False
    Note over C,S: Stream torn down, log bytes lost

    Note over C,S: After fix — ABOUT_TO_START_RUN + bootstrap tolerance
    S->>S: ABOUT_TO_START_RUN → register handler (earlier)
    C->>S: START_RUN fires → first chunk sent
    alt Handler already registered (common case)
        S-->>C: ReturnCode.OK
        C->>C: "_first_ok_received = True (strict mode)"
    else Handler still racing (rare)
        S-->>C: EXECUTION_EXCEPTION
        C->>C: process_replies: seek back, return None (keep alive)
        C->>S: retry same chunk
        S-->>C: ReturnCode.OK (handler now ready)
        C->>C: "_first_ok_received = True (strict mode)"
    end
    loop Subsequent chunks
        C->>S: data chunk
        S-->>C: ReturnCode.OK
    end
    C->>S: EOF
Loading

Reviews (1): Last reviewed commit: "[2.8] Fix _LogTailProducer EXECUTION_EXC..." | Re-trigger Greptile

Comment on lines +354 to +367
if not self._first_ok_received and replies:
transient = all(
reply.get_return_code(ReturnCode.OK) == ReturnCode.EXECUTION_EXCEPTION for reply in replies.values()
)
if transient:
if self._last_was_data and self.file is not None and self._last_data_offset is not None:
try:
self.file.seek(self._last_data_offset)
except OSError as e:
self.logger.warning(f"could not seek back after bootstrap miss: {e}")
self.logger.debug(
f"transient bootstrap miss on first chunk to {list(replies)}; rolling back and retrying"
)
return None # keep producing — receiver should be ready by next send
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Persistent receiver failure silently retries forever

_first_ok_received flips to True only on the first OK reply. If the receiver is genuinely broken (misconfigured, wrong channel/topic, never registers), the gate never flips: every EXECUTION_EXCEPTION passes the not self._first_ok_received and transient check and returns None indefinitely, keeping the stream alive and re-seeking the same bytes in an infinite loop. The pre-fix behavior — ERROR on the first non-OK reply, stream ends — would at least surface the failure. The PR's behavior matrix documents "second EXECUTION_EXCEPTION (post-first-OK gate flips) → ERROR, stream ends", but that path requires seeing an OK first; a truly persistent failure with no OK is unhandled.

A simple bound (e.g., _bootstrap_miss_count capped at a small value, say 10) would preserve the retry benefit for transient races while ensuring a genuine receiver failure eventually surfaces as an error.

Comment on lines +344 to +345
# Tolerate exactly one bootstrap miss at job startup. The receiver
# registers its handler on START_RUN in the server's job subprocess;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale comment: the receiver was changed in this same PR to register on ABOUT_TO_START_RUN, not START_RUN. Leaving the old event name here will mislead future readers trying to understand the race condition.

Suggested change
# Tolerate exactly one bootstrap miss at job startup. The receiver
# registers its handler on START_RUN in the server's job subprocess;
# Tolerate exactly one bootstrap miss at job startup. The receiver
# registers its handler on ABOUT_TO_START_RUN in the server's job subprocess;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant