Skip auth filter for cellnet protocol-level bye messages#4540
Skip auth filter for cellnet protocol-level bye messages#4540nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR extends the auth-filter bypass in
Confidence Score: 5/5Safe to merge — the change is a minimal, well-bounded auth-filter exemption for a protocol-level lifecycle message that carries no FL data. The new bypass is restricted to direct-neighbor bye messages via the to_cell is not None AND to_cell == destination guard. All three previously-identified attack vectors are addressed. The _peer_goodbye handler evicts based on the network-layer ENDPOINT property rather than any message header, so a crafted bye cannot impersonate a legitimate peer. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Client Cell
participant AF as Auth Filter (validate_auth_headers)
participant PG as _peer_goodbye handler
participant A as agents dict
Note over C: Cell.stop() called
C->>AF: "bye msg (channel=cellnet.channel, TO_CELL=server, DESTINATION=server)"
AF->>AF: "topic==Bye AND channel==CELLNET?"
AF->>AF: to_cell is not None? ✓
AF->>AF: "to_cell == destination? ✓ (direct-neighbor)"
AF-->>PG: return None (bypass auth)
PG->>A: agents.pop(peer_ep.name) via ENDPOINT prop
PG-->>C: ack Message()
Note over C,AF: Forwarded/crafted bye (TO_CELL ≠ DESTINATION)
C->>AF: "crafted bye (TO_CELL=relay, DESTINATION=server)"
AF->>AF: "to_cell != destination → guard fails"
AF-->>C: UNAUTHENTICATED error
Note over C,AF: Both headers absent
C->>AF: crafted bye (no TO_CELL, no DESTINATION)
AF->>AF: to_cell is not None? ✗
AF-->>C: UNAUTHENTICATED error
Reviews (4): Last reviewed commit: "Skip auth filter for cellnet protocol-le..." | Re-trigger Greptile |
| if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or ( | ||
| topic == "bye" and channel == "cellnet.channel" | ||
| ): |
There was a problem hiding this comment.
The
"bye" topic and "cellnet.channel" channel are raw string literals here, but the rest of this function already uses CellChannelTopic.Register, CellChannelTopic.Challenge, and CellChannel.SERVER_MAIN for the same purpose. The underlying values live as private module-level constants _TOPIC_BYE = "bye" and _CHANNEL = "cellnet.channel" in core_cell.py. Adding BYE = "bye" to CellChannelTopic and CELLNET = "cellnet.channel" to CellChannel in defs.py would keep this consistent with the existing pattern and prevent silent drift if the strings ever change.
| if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or ( | |
| topic == "bye" and channel == "cellnet.channel" | |
| ): | |
| if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or ( | |
| topic == CellChannelTopic.Bye and channel == CellChannel.CELLNET | |
| ): |
There was a problem hiding this comment.
This bypass runs before forwarding, so an unauthenticated peer can send channel=cellnet.channel/topic=bye with a different final destination and have the server/relay forward it without FL auth. At the target, _peer_goodbye removes the immediate endpoint from agents, so a crafted forwarded bye can make another cell drop its server/relay agent. Please restrict the bye exemption to direct-neighbor messages, for example by checking the endpoint prop/origin and that TO_CELL == DESTINATION before skipping auth.
49f8ad2 to
0932b28
Compare
|
Thanks for the review @greptile-apps. Addressed the named-constant feedback in the latest force-push:
Left the private |
|
The implementation looks solid. The import chain is clean: The bypass condition reads well: if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or (
topic == CellChannelTopic.Bye and channel == CellChannel.CELLNET
):And the values match what LGTM. Approved. Tip: You can customize Greptile's behavior for this repo with |
0932b28 to
58f02d9
Compare
|
Thanks @YuanTingHsieh — that's a real attack vector I missed. Updated in latest force-push (58f02d9): The bypass now only fires when Verified locally that the legitimate |
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
| if ( | ||
| topic == CellChannelTopic.Bye | ||
| and channel == CellChannel.CELLNET | ||
| and message.get_header(MessageHeaderKey.TO_CELL) == message.get_header(MessageHeaderKey.DESTINATION) | ||
| ): |
There was a problem hiding this comment.
When both
TO_CELL and DESTINATION headers are absent from a crafted message, get_header() returns None for each, and None == None evaluates to True in Python. This means the direct-neighbor guard intended to protect against forwarded-bye attacks can be bypassed by an unauthenticated sender that simply omits both routing headers. While _peer_goodbye determines the agent to evict via request.get_prop(MessagePropKey.ENDPOINT) (the actual connection property, not headers), the auth filter bypass itself should not activate unless a real routing destination exists.
| if ( | |
| topic == CellChannelTopic.Bye | |
| and channel == CellChannel.CELLNET | |
| and message.get_header(MessageHeaderKey.TO_CELL) == message.get_header(MessageHeaderKey.DESTINATION) | |
| ): | |
| to_cell = message.get_header(MessageHeaderKey.TO_CELL) | |
| destination = message.get_header(MessageHeaderKey.DESTINATION) | |
| if ( | |
| topic == CellChannelTopic.Bye | |
| and channel == CellChannel.CELLNET | |
| and to_cell is not None | |
| and to_cell == destination | |
| ): |
When a cell shuts down, Cell.stop() broadcasts an empty Message() with topic="bye" on channel="cellnet.channel" to notify peers. Because cellnet goodbye is a protocol-level lifecycle primitive, it intentionally carries no FL-level auth headers — the same reason Register/Challenge messages already bypass the auth filter. Without an exemption, every shutdown logs "FederatedServer ERROR unauthenticated msg topic='bye' missing client name" in server/log.txt, polluting log analysis and tripping any tool or test that greps for ERROR. This PR exempts cellnet bye from the auth filter, but **only when the message is for the immediate (direct-neighbor) recipient**, i.e. when TO_CELL == DESTINATION. If TO_CELL != DESTINATION the message would be forwarded by the server/relay, which would let an unauthenticated remote peer craft a bye with DESTINATION pointing at a downstream cell whose _peer_goodbye would then drop the immediate endpoint (the server/relay) from its agents dict — disrupting the FL system. Forwarded bye messages fall through to the normal auth check. The receiver-side handler (_peer_goodbye in core_cell.py) only removes the sender from the local agents dict and acks back, so allowing an unauthenticated direct-neighbor bye is safe. Repro before fix: nvflare poc prepare --number_of_clients 2 --force nvflare poc start --exclude admin@nvidia.com nvflare poc stop grep ERROR /tmp/nvflare/poc/example_project/prod_00/server/log.txt # produces one ERROR line: FederatedServer - ERROR - unauthenticated msg # (channel='cellnet.channel' topic='bye') ... missing client name After fix: zero ERROR lines. Signed-off-by: shaxie <shaxie@nvidia.com>
|
Thanks for catching that @greptile-apps. Addressed in latest force-push: Added the explicit |
58f02d9 to
efe3019
Compare
|
The fix looks solid. The The updated condition is clean and the inline comment explains the routing-security rationale well. No further concerns on this thread. |
nvidianz
left a comment
There was a problem hiding this comment.
Since we already have 2.8 branch, we should create a PR against 2.8. When it's merged, cherry-pick it back to main.
| RETURN_ONLY = "return_only" | ||
| EDGE_REQUEST = "edge_request" | ||
| HCI = "hci_channel" | ||
| CELLNET = "cellnet.channel" |
There was a problem hiding this comment.
core_cell.py already defines these as _CHANNEL / _TOPIC_BYE. Let's move all such constants into defs.py (the canonical home) and have core_cell.py import them, to avoid duplicates that can drift.
Summary
nvflare poc stop),Cell.stop()broadcasts a goodbye to all peers usingtopic="bye"onchannel="cellnet.channel"with an emptyMessage(). As a cellnet protocol-level lifecycle primitive, this messageintentionally carries no FL-level auth headers — the same reason
Register/Challengealready bypass the auth filter.FederatedServer ERROR unauthenticated msg (channel='cellnet.channel' topic='bye') missing client nameline in the server log, polluting log analysis. As a sideeffect, the receiver's
_peer_goodbyehandler is never invoked, so agent-dict cleanup waits on heartbeat timeout instead ofexecuting immediately.
Register/Challengebypass invalidate_auth_headers()to cover the symmetric end-of-lifebyemessage.Changes
nvflare/private/fed/authenticator.py: extend the auth-filter bypass to(topic="bye", channel="cellnet.channel")alongside the existing
Register/Challengeexemption.Repro
Before fix:
After fix: zero ERROR lines.
Security
The bye handler
_peer_goodbyeinnvflare/fuel/f3/cellnet/core_cell.pyonly removes the sender from the localagentsdict and acks back — no sensitive operations. Allowing unauthenticated bye through has zero security impact.
Types of changes
./runtest.sh(license + style; full unit tests deferred to CI).