Skip to content

Skip auth filter for cellnet protocol-level bye messages#4540

Open
nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
nvshaxie:fix/cellnet-bye-unauthenticated-error-log
Open

Skip auth filter for cellnet protocol-level bye messages#4540
nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
nvshaxie:fix/cellnet-bye-unauthenticated-error-log

Conversation

@nvshaxie
Copy link
Copy Markdown
Contributor

@nvshaxie nvshaxie commented May 7, 2026

Summary

  • During any cell shutdown (e.g. nvflare poc stop), Cell.stop() broadcasts a goodbye to all peers using topic="bye" on
    channel="cellnet.channel" with an empty Message(). As a cellnet protocol-level lifecycle primitive, this message
    intentionally carries no FL-level auth headers — the same reason Register/Challenge already bypass the auth filter.
  • Without this exemption, every shutdown emits a misleading FederatedServer ERROR unauthenticated msg (channel='cellnet.channel' topic='bye') missing client name line in the server log, polluting log analysis. As a side
    effect, the receiver's _peer_goodbye handler is never invoked, so agent-dict cleanup waits on heartbeat timeout instead of
    executing immediately.
  • This PR extends the existing Register/Challenge bypass in validate_auth_headers() to cover the symmetric end-of-life
    bye message.

Changes

  • nvflare/private/fed/authenticator.py: extend the auth-filter bypass to (topic="bye", channel="cellnet.channel")
    alongside the existing Register/Challenge exemption.

Repro

Before fix:

nvflare poc prepare --number_of_clients 2 --force
nvflare poc start --exclude admin@nvidia.com
nvflare poc stop
grep ERROR /tmp/nvflare/poc/example_project/prod_00/server/log.txt
# produces one ERROR line: FederatedServer - ERROR - unauthenticated msg (channel='cellnet.channel' topic='bye') ... missing
client name

After fix: zero ERROR lines.

Security

The bye handler _peer_goodbye in nvflare/fuel/f3/cellnet/core_cell.py only removes the sender from the local agents
dict and acks back — no sensitive operations. Allowing unauthenticated bye through has zero security impact.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh (license + style; full unit tests deferred to CI).
  • In-line docstrings updated.
  • Documentation updated.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR extends the auth-filter bypass in validate_auth_headers() to cover the cellnet protocol-level bye message sent by Cell.stop() during shutdown, eliminating a spurious unauthenticated msg error on every graceful shutdown and enabling immediate _peer_goodbye cleanup instead of waiting for heartbeat timeout.

  • nvflare/fuel/f3/cellnet/defs.py: adds CellChannel.CELLNET = \"cellnet.channel\" and CellChannelTopic.Bye = \"bye\" to replace raw string literals, consistent with the existing Register/Challenge constants.
  • nvflare/private/fed/authenticator.py: adds a guarded bypass for (topic=bye, channel=cellnet.channel) that only fires when TO_CELL is non-None and equals DESTINATION, ensuring the exemption is restricted to direct-neighbor deliveries and rejecting both the "both-headers-absent" None == None bypass vector and forwarded-bye forwarding attacks.

Confidence Score: 5/5

Safe to merge — the change is a minimal, well-bounded auth-filter exemption for a protocol-level lifecycle message that carries no FL data.

The new bypass is restricted to direct-neighbor bye messages via the to_cell is not None AND to_cell == destination guard. All three previously-identified attack vectors are addressed. The _peer_goodbye handler evicts based on the network-layer ENDPOINT property rather than any message header, so a crafted bye cannot impersonate a legitimate peer.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/fuel/f3/cellnet/defs.py Adds CellChannel.CELLNET and CellChannelTopic.Bye constants; values match the private _CHANNEL/_TOPIC_BYE literals in core_cell.py and are re-exported through nvflare.private.defs for use in authenticator.py.
nvflare/private/fed/authenticator.py Adds a narrowly-scoped auth bypass for direct-neighbor bye messages; the to_cell is not None guard prevents the None==None bypass, and to_cell == destination prevents forwarded-bye forwarding attacks; logic is correct across all None/non-None header combinations.

Sequence Diagram

sequenceDiagram
    participant C as Client Cell
    participant AF as Auth Filter (validate_auth_headers)
    participant PG as _peer_goodbye handler
    participant A as agents dict

    Note over C: Cell.stop() called
    C->>AF: "bye msg (channel=cellnet.channel, TO_CELL=server, DESTINATION=server)"
    AF->>AF: "topic==Bye AND channel==CELLNET?"
    AF->>AF: to_cell is not None? ✓
    AF->>AF: "to_cell == destination? ✓ (direct-neighbor)"
    AF-->>PG: return None (bypass auth)
    PG->>A: agents.pop(peer_ep.name) via ENDPOINT prop
    PG-->>C: ack Message()

    Note over C,AF: Forwarded/crafted bye (TO_CELL ≠ DESTINATION)
    C->>AF: "crafted bye (TO_CELL=relay, DESTINATION=server)"
    AF->>AF: "to_cell != destination → guard fails"
    AF-->>C: UNAUTHENTICATED error

    Note over C,AF: Both headers absent
    C->>AF: crafted bye (no TO_CELL, no DESTINATION)
    AF->>AF: to_cell is not None? ✗
    AF-->>C: UNAUTHENTICATED error
Loading

Reviews (4): Last reviewed commit: "Skip auth filter for cellnet protocol-le..." | Re-trigger Greptile

Comment thread nvflare/private/fed/authenticator.py Outdated
Comment on lines +321 to +323
if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or (
topic == "bye" and channel == "cellnet.channel"
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The "bye" topic and "cellnet.channel" channel are raw string literals here, but the rest of this function already uses CellChannelTopic.Register, CellChannelTopic.Challenge, and CellChannel.SERVER_MAIN for the same purpose. The underlying values live as private module-level constants _TOPIC_BYE = "bye" and _CHANNEL = "cellnet.channel" in core_cell.py. Adding BYE = "bye" to CellChannelTopic and CELLNET = "cellnet.channel" to CellChannel in defs.py would keep this consistent with the existing pattern and prevent silent drift if the strings ever change.

Suggested change
if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or (
topic == "bye" and channel == "cellnet.channel"
):
if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or (
topic == CellChannelTopic.Bye and channel == CellChannel.CELLNET
):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bypass runs before forwarding, so an unauthenticated peer can send channel=cellnet.channel/topic=bye with a different final destination and have the server/relay forward it without FL auth. At the target, _peer_goodbye removes the immediate endpoint from agents, so a crafted forwarded bye can make another cell drop its server/relay agent. Please restrict the bye exemption to direct-neighbor messages, for example by checking the endpoint prop/origin and that TO_CELL == DESTINATION before skipping auth.

@nvshaxie nvshaxie force-pushed the fix/cellnet-bye-unauthenticated-error-log branch from 49f8ad2 to 0932b28 Compare May 7, 2026 09:00
@nvshaxie
Copy link
Copy Markdown
Contributor Author

nvshaxie commented May 7, 2026

Thanks for the review @greptile-apps. Addressed the named-constant feedback in the latest force-push:

  • Added CellChannelTopic.Bye = "bye" and CellChannel.CELLNET = "cellnet.channel" in nvflare/fuel/f3/cellnet/defs.py.
  • Updated the bypass in authenticator.py to use CellChannelTopic.Bye and CellChannel.CELLNET instead of raw string literals, matching the existing Register/Challenge pattern.

Left the private _TOPIC_BYE/_CHANNEL in core_cell.py untouched to keep this PR focused; switching them to the new public constants can be a small follow-up if desired.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

The implementation looks solid. The import chain is clean: authenticator.py imports CellChannel/CellChannelTopic from nvflare.private.defs, which re-exports them from nvflare.fuel.f3.cellnet.defs — so the new CellChannel.CELLNET and CellChannelTopic.Bye constants flow through correctly with no additional import changes needed.

The bypass condition reads well:

if (topic in [CellChannelTopic.Register, CellChannelTopic.Challenge] and channel == CellChannel.SERVER_MAIN) or (
    topic == CellChannelTopic.Bye and channel == CellChannel.CELLNET
):

And the values match what core_cell.py uses (_CHANNEL = "cellnet.channel", _TOPIC_BYE = "bye"), so there's no functional regression. The decision to leave core_cell.py's private constants as-is is reasonable scope management — the follow-up migration would be a self-contained cleanup.

LGTM. Approved.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

@pcnudde pcnudde requested a review from nvidianz May 7, 2026 19:33
@nvshaxie nvshaxie force-pushed the fix/cellnet-bye-unauthenticated-error-log branch from 0932b28 to 58f02d9 Compare May 8, 2026 01:36
@nvshaxie
Copy link
Copy Markdown
Contributor Author

nvshaxie commented May 8, 2026

Thanks @YuanTingHsieh — that's a real attack vector I missed. Updated in latest force-push (58f02d9):

The bypass now only fires when TO_CELL == DESTINATION (direct-neighbor). Forwarded bye messages (TO_CELL != DESTINATION) fall through to the normal auth check, so an unauthenticated remote peer can no longer craft a forwarded bye to drop a downstream cell's server/relay agent.

Verified locally that the legitimate nvflare poc stop path (broadcast bye to direct neighbors) still produces 0 ERROR lines in server log.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment on lines +331 to +335
if (
topic == CellChannelTopic.Bye
and channel == CellChannel.CELLNET
and message.get_header(MessageHeaderKey.TO_CELL) == message.get_header(MessageHeaderKey.DESTINATION)
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 When both TO_CELL and DESTINATION headers are absent from a crafted message, get_header() returns None for each, and None == None evaluates to True in Python. This means the direct-neighbor guard intended to protect against forwarded-bye attacks can be bypassed by an unauthenticated sender that simply omits both routing headers. While _peer_goodbye determines the agent to evict via request.get_prop(MessagePropKey.ENDPOINT) (the actual connection property, not headers), the auth filter bypass itself should not activate unless a real routing destination exists.

Suggested change
if (
topic == CellChannelTopic.Bye
and channel == CellChannel.CELLNET
and message.get_header(MessageHeaderKey.TO_CELL) == message.get_header(MessageHeaderKey.DESTINATION)
):
to_cell = message.get_header(MessageHeaderKey.TO_CELL)
destination = message.get_header(MessageHeaderKey.DESTINATION)
if (
topic == CellChannelTopic.Bye
and channel == CellChannel.CELLNET
and to_cell is not None
and to_cell == destination
):

When a cell shuts down, Cell.stop() broadcasts an empty Message() with
topic="bye" on channel="cellnet.channel" to notify peers. Because cellnet
goodbye is a protocol-level lifecycle primitive, it intentionally carries
no FL-level auth headers — the same reason Register/Challenge messages
already bypass the auth filter. Without an exemption, every shutdown logs
"FederatedServer ERROR unauthenticated msg topic='bye' missing client
name" in server/log.txt, polluting log analysis and tripping any tool or
test that greps for ERROR.

This PR exempts cellnet bye from the auth filter, but **only when the
message is for the immediate (direct-neighbor) recipient**, i.e. when
TO_CELL == DESTINATION. If TO_CELL != DESTINATION the message would be
forwarded by the server/relay, which would let an unauthenticated remote
peer craft a bye with DESTINATION pointing at a downstream cell whose
_peer_goodbye would then drop the immediate endpoint (the server/relay)
from its agents dict — disrupting the FL system. Forwarded bye messages
fall through to the normal auth check.

The receiver-side handler (_peer_goodbye in core_cell.py) only removes
the sender from the local agents dict and acks back, so allowing an
unauthenticated direct-neighbor bye is safe.

Repro before fix:
  nvflare poc prepare --number_of_clients 2 --force
  nvflare poc start --exclude admin@nvidia.com
  nvflare poc stop
  grep ERROR /tmp/nvflare/poc/example_project/prod_00/server/log.txt
  # produces one ERROR line: FederatedServer - ERROR - unauthenticated msg
  # (channel='cellnet.channel' topic='bye') ... missing client name

After fix: zero ERROR lines.

Signed-off-by: shaxie <shaxie@nvidia.com>
@nvshaxie
Copy link
Copy Markdown
Contributor Author

nvshaxie commented May 8, 2026

Thanks for catching that @greptile-apps. Addressed in latest force-push:

Added the explicit to_cell is not None check so a crafted message that omits both routing headers can no longer reach the bypass via None == None. Local repro confirms the legitimate nvflare poc stop path still produces 0 ERROR lines.

@nvshaxie nvshaxie force-pushed the fix/cellnet-bye-unauthenticated-error-log branch from 58f02d9 to efe3019 Compare May 8, 2026 01:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

The fix looks solid. The to_cell is not None guard correctly closes the None == None bypass — a crafted message omitting both TO_CELL and DESTINATION headers will now fall through to the normal auth check rather than getting a free pass.

The updated condition is clean and the inline comment explains the routing-security rationale well. No further concerns on this thread.

Copy link
Copy Markdown
Collaborator

@nvidianz nvidianz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have 2.8 branch, we should create a PR against 2.8. When it's merged, cherry-pick it back to main.

RETURN_ONLY = "return_only"
EDGE_REQUEST = "edge_request"
HCI = "hci_channel"
CELLNET = "cellnet.channel"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core_cell.py already defines these as _CHANNEL / _TOPIC_BYE. Let's move all such constants into defs.py (the canonical home) and have core_cell.py import them, to avoid duplicates that can drift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants