[2.8] Fix StorageException after every job from empty SystemLogStreamer context#4559
[2.8] Fix StorageException after every job from empty SystemLogStreamer context#4559nvidianz wants to merge 1 commit intoNVIDIA:2.8from
Conversation
…er context NVBugs 6151564 / FLARE-2921. After PR NVIDIA#4476 every successful federated job emits this error in the server log: ObjectStreamer - ERROR - exception from stream_done_cb dispatch_stream_done: StorageException: path /tmp/nvflare/jobs-storage/unknown is not a valid directory. Root cause: SystemLogStreamer._on_job_completed runs on CLIENT_PARENT and uploads the per-job error_log.txt snapshot to the server. It hands the upload thread a fresh context from `engine.new_context()`. CLIENT_PARENT serves multiple jobs over its lifetime, so its engine ctx_manager does NOT carry per-job RUN_NUM / IDENTITY_NAME as sticky props -- and the fresh context arrives at the receiver with an empty job_id. JobLogReceiver._get_trusted_stream_identity reads job_id via `peer_ctx.get_job_id(default="unknown")` from the sender-forwarded context. With the empty fresh context, peer_ctx.get_job_id() returns "" and the receiver falls back to the literal "unknown", then asks the job manager to save the log under `.../jobs-storage/unknown`, which fails because it isn't a real path. The bug was dormant before NVIDIA#4476 because StaticFileBuilder. _modify_system_log_streamer used to strip system_log_streamer from deployed client resources.json by default. NVIDIA#4476 removed that filter and changed the default to allow_log_streaming=True, so the dormant buggy code now runs after every job. Fix: In SystemLogStreamer._on_job_completed, explicitly seed RUN_NUM and IDENTITY_NAME on the fresh stream context before handing it to the upload thread. The receiver continues to read job identity from the sender-forwarded peer_ctx as before -- preserving the existing peer-context-as-trust-source security model and leaving JobLogReceiver._get_trusted_stream_identity unchanged. The pre-existing test test_job_log_receiver_uses_trusted_peer_identity_for_storage (which deliberately passes forged stream_ctx values to assert the receiver ignores them) continues to pass unchanged, confirming this fix doesn't open a path-traversal / job-spoofing surface. Why the sender-side fix instead of receiver-side: The bug's "Option A" (read job_id from stream_ctx in the receiver) was considered but rejected. stream_ctx is sender-supplied data, while peer_ctx is authenticated cellnet identity. Routing storage paths based on stream_ctx would let a malicious client spoof another job's id and write log content into that other job's storage directory. The existing _get_trusted_stream_identity name and security test make the trust boundary explicit, so the right place to fix the bug is the sender that handed off an empty context, not the receiver. Tests: - New: test_completed_upload_seeds_run_num_and_identity_on_fresh_context verifies the fresh context handed to the upload thread now carries the real job_id and client_name (regression for the StorageException). - Existing: test_system_log_streamer_uploads_completed_error_log_snapshot and the JobLogReceiver security test continue to pass unchanged. Verified locally: - pytest tests/unit_test/app_common/logging -v -> 13 passed (Python 3.12.10, pytest 8.3.4) - black, isort, flake8 clean on changed files.
Greptile SummaryFixes a
Confidence Score: 5/5Safe to merge — the change is a minimal two-line seed of well-known reserved keys on a fresh context, guarded so it only activates when the engine provides a separate context object, and the security test for receiver-side trust-boundary routing passes unchanged. The fix is targeted at a single root cause, clearly explained in code comments and the PR description. The new regression test faithfully mirrors the production empty-context precondition, and the existing security test continues to pass unmodified, confirming the trust boundary is preserved. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant EP as CLIENT_PARENT Engine
participant SLS as SystemLogStreamer
participant UL as Upload Thread
participant SRV as JobLogReceiver (Server)
EP->>SLS: JOB_COMPLETED event (fl_ctx)
SLS->>EP: engine.new_context()
EP-->>SLS: stream_fl_ctx (empty — no RUN_NUM/IDENTITY_NAME)
Note over SLS: FIX: seed stream_fl_ctx with RUN_NUM=job_id, IDENTITY_NAME=client_name
SLS->>UL: "Thread(target=_stream_completed_log, args=(stream_fl_ctx, ...))"
UL->>SRV: "LogStreamer.stream_log(fl_ctx=stream_fl_ctx, stream_ctx={job_id, client_name})"
Note over SRV: _get_trusted_stream_identity reads peer_ctx (cellnet-authenticated) → correct job path
SRV-->>UL: stream_done_cb (success)
Reviews (1): Last reviewed commit: "[2.8] Fix StorageException after every j..." | Re-trigger Greptile |
| engine = fl_ctx.get_engine() | ||
| stream_fl_ctx = engine.new_context() if engine else fl_ctx | ||
| if stream_fl_ctx is not fl_ctx: | ||
| stream_fl_ctx.put(key=ReservedKey.RUN_NUM, value=job_id, private=True, sticky=False) |
There was a problem hiding this comment.
AuxRunner forwards peer identity with fl_ctx.get_all_public_props(), and the receiver rebuilds peer_ctx from those public props. These new RUN_NUM and IDENTITY_NAME values are written with private=True, so they remain local to the upload thread context and are not included in the peer props sent to JobLogReceiver. The regression test only checks the local context, so it passes while the actual stream transport still sees missing or empty job identity. These should be public, matching FLContextManager.new_context() for job contexts, and the test should assert stream_fl_ctx.get_all_public_props() or exercise the aux/receiver path.
Summary
After PR #4476, every successful federated job emits this error in the server log:
(plus, in some cases, an earlier
no stream processing info registered for log_streaming:live_logline during the run.)Root cause
SystemLogStreamer._on_job_completedruns onCLIENT_PARENTand uploads the per-joberror_log.txtsnapshot to the server. It hands the upload thread a fresh context fromengine.new_context(). CLIENT_PARENT serves multiple jobs over its lifetime, so its engine ctx_manager does NOT carry per-jobRUN_NUM/IDENTITY_NAMEas sticky props — and the fresh context arrives at the receiver with an emptyjob_id.JobLogReceiver._get_trusted_stream_identityreadsjob_idviapeer_ctx.get_job_id(default="unknown")from the sender-forwarded context. With the empty fresh context,peer_ctx.get_job_id()returns""and the receiver falls back to the literal"unknown", then asks the job manager to save the log under.../jobs-storage/unknown, which fails because it isn't a real path.The bug was dormant before #4476:
StaticFileBuilder._modify_system_log_streamerused to stripsystem_log_streamerfrom deployed clientresources.jsonby default. #4476 removed that filter and flipped the default toallow_log_streaming=True, so the dormant buggy code now runs after every job.Fix
In
SystemLogStreamer._on_job_completed, explicitly seedRUN_NUMandIDENTITY_NAMEon the fresh stream context before handing it to the upload thread. The receiver continues to read job identity from the sender-forwardedpeer_ctxas before — preserving the existing peer-context-as-trust-source security model and leavingJobLogReceiver._get_trusted_stream_identityunchanged.Why sender-side instead of receiver-side
The bug originally proposed an "Option A" (read
job_idfromstream_ctxin the receiver) but that approach was rejected after reviewing the receiver's existing security test:tests/unit_test/app_common/logging/job_log_receiver_test.py::test_job_log_receiver_uses_trusted_peer_identity_for_storagedeliberately passes forgedstream_ctxvalues ("../../forged_job","../../forged_client") and asserts the receiver routes storage paths viapeer_ctx(authenticated cellnet identity), notstream_ctx(sender-supplied data). The function name_get_trusted_stream_identityand the test name make the trust boundary explicit.stream_ctx-based routing would let a malicious client spoof another job's id and write log content into that other job's storage directory. Fixing the bug at its real source — the sender that handed off an empty context — preserves the security model and makes the existing security test continue to pass unchanged. Confirmed by running it as part of the test suite below.Tests
test_completed_upload_seeds_run_num_and_identity_on_fresh_context— regression test that verifies the fresh context handed to the upload thread now carries the realjob_idandclient_name.test_job_log_receiver_uses_trusted_peer_identity_for_storage— confirms this fix does not relax the receiver's authentication model.SystemLogStreamertests unchanged.Verified locally
pytest tests/unit_test/app_common/logging -v→ 13 passed (12 existing + 1 new) on Python 3.12.10 / pytest 8.3.4.black,isort,flake8clean on all changed files.Test plan
grep "StorageException.*jobs-storage/unknown" .../server/log.txtreturns no lines after the fix.Types of changes
app_common/logging/.🤖 Generated with Claude Code