Skip to content

Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555

Open
nvshaxie wants to merge 2 commits intoNVIDIA:mainfrom
nvshaxie:fix/swarm-min-clients-and-start-timeout
Open

Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555
nvshaxie wants to merge 2 commits intoNVIDIA:mainfrom
nvshaxie:fix/swarm-min-clients-and-start-timeout

Conversation

@nvshaxie
Copy link
Copy Markdown
Contributor

@nvshaxie nvshaxie commented May 8, 2026

Fixes # .

Description

The swarm CIFAR10 example fails out of the box on both NVFlare main and 2.7.2 with two distinct issues, both of which need to be fixed for the example to complete a single round.

Summary

Bug 1 — SwarmServerConfig.min_clients defaulted to None

add_swarm() forwards every server_config field — including this None — to SwarmServerController(min_clients=...), overriding the controller's typed default of int = 0. The value then reaches BaseServerCtl.__init__ where if min_clients < 0: raises:

TypeError: '<' not supported between instances of 'NoneType' and 'int'

The example crashes at construction time, before any simulator is launched.

The other Optional[List[str]] fields (aggr_clients, train_clients, result_clients) tolerate None because the controller normalizes them via if not x: x = []. min_clients has no such normalization — its type contract is int, and the documented "0 means all participating clients are required" semantics in BaseServerCtl already covers the "not specified" case.

Fix: change SwarmServerConfig.min_clients default from None to int = 0, matching the type and default of both SwarmServerController.__init__ and BaseServerCtl.__init__.

Bug 2 — start_task_timeout left at 10s default

swarm_script_runner_cifar10.py did not pass start_task_timeout, so the example used the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10). After Bug 1 is fixed, the simulator starts and configures both sites, but swarm_start times out after 10s while site-1 is still initializing CIFAR-10 + PyTorch model, and the run aborts with FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1").

Fix: pass start_task_timeout=300 in the example, matching the pattern already used in examples/advanced/swarm_learning/swarm_pt/job.py (start_task_timeout=1200). Sibling PR #4554 applies the same example-side fix to the cyclic CIFAR10 example.

Changes

  • nvflare/app_common/ccwf/ccwf_job.py: SwarmServerConfig.min_clients default Noneint = 0.
  • examples/advanced/job_api/pt/swarm_script_runner_cifar10.py: pass start_task_timeout=300 to SwarmServerConfig.
  • tests/unit_test/app_common/ccwf/test_ccwf_job_config.py: new regression test asserting SwarmServerConfig().min_clients == 0 so the default cannot silently regress to None.

Repro

Before fix (on a 256-core / A100X host, both main and 2.7.2):

cd examples/advanced/job_api/pt/
python swarm_script_runner_cifar10.py
# Python exits immediately:
# TypeError: '<' not supported between instances of 'NoneType' and 'int'
# in nvflare/app_common/ccwf/server_ctl.py:161 (BaseServerCtl.__init__).
# Workdir does not even contain a pt_swarm subdir.

After fix (Bug 1 only): the simulator starts, both sites are configured, then:

WFCommServer - INFO - task swarm_start exit with status TaskCompletionStatus.TIMEOUT
ServerRunner - ERROR - Aborting current RUN due to FATAL_SYSTEM_ERROR received:
                       failed to start workflow controller on client site-1

After both fixes: a 3-round swarm CCWF run completes in ~6:45 with no FATAL_SYSTEM_ERROR lines in the resulting pt_swarm/server/log.txt. All 92 existing ccwf unit tests still pass.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally (pytest tests/unit_test/app_common/ccwf/ 92 passed; black + isort + flake8 clean on changed files; full ./runtest.sh deferred to CI).
  • In-line docstrings updated.
  • Documentation updated.

…rt_task_timeout

The swarm CIFAR10 example fails out of the box on both NVFlare main and
2.7.2 with two distinct issues, both of which need to be fixed for the
example to complete a single round.

(1) `SwarmServerConfig.min_clients` defaulted to None.
`add_swarm()` forwards every server_config field — including this None —
to `SwarmServerController(min_clients=...)`, overriding the controller's
own typed default of `int = 0`. The value then reaches
`BaseServerCtl.__init__` where `if min_clients < 0:` raises:
`TypeError: '<' not supported between instances of 'NoneType' and 'int'`
The example crashes at construction time, before any simulator is
launched.

The other Optional-list fields (`aggr_clients`, `train_clients`,
`result_clients`) tolerate None because `SwarmServerController` and
`BaseServerCtl` explicitly normalize them via `if not x: x = []`.
`min_clients` has no such normalization — its type contract is `int`,
and the documented "0 means all participating clients are required"
semantics in `BaseServerCtl` already covers the "not specified" case.

Fix: change `SwarmServerConfig.min_clients` default from `None` to
`int = 0`, matching the type and default of both
`SwarmServerController.__init__` and `BaseServerCtl.__init__`.

(2) `swarm_script_runner_cifar10.py` did not pass start_task_timeout, so
the example used the framework default of 10 s
(`nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10`). On any
host where the starting client takes longer than 10 s to download CIFAR-10
and initialize the PyTorch model after responding to swarm_config, the
server-side `SwarmServerController` times out the swarm_start task and
aborts the run with FATAL_SYSTEM_ERROR ("failed to start workflow
controller on client site-1") even though the client is still progressing.

Fix: pass `start_task_timeout=300` in the example, matching the pattern
already used in `examples/advanced/swarm_learning/swarm_pt/job.py`.

Verified locally: with both fixes applied, the example completes a 3-round
swarm CCWF run on a 256-core / A100X node in ~6:45 with no
FATAL_SYSTEM_ERROR lines in the resulting `pt_swarm/server/log.txt`.

Add a regression unit test asserting `SwarmServerConfig().min_clients == 0`
so the default cannot silently regress to None.

Signed-off-by: shaxie <shaxie@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR fixes two distinct bugs that together prevented the swarm CIFAR-10 example from running end-to-end: a TypeError crash at construction time due to min_clients=None, and a swarm_start timeout caused by an insufficient 10-second default during CIFAR-10/PyTorch initialization.

  • ccwf_job.py: SwarmServerConfig.min_clients default changed from None to int = 0, aligning it with SwarmServerController and BaseServerCtl type contracts and eliminating the TypeError on min_clients < 0.
  • swarm_script_runner_cifar10.py: start_task_timeout=300 added to both SwarmServerConfig and CrossSiteEvalConfig, covering both the swarm-start and CSE-start initialization races.
  • test_ccwf_job_config.py: New regression tests guard the min_clients default and verify explicit values pass through unchanged.

Confidence Score: 5/5

All three changes are targeted, mechanical fixes with no side-effects on existing paths; the regression test locks in the corrected default.

The min_clients: int = 0 change is a one-line type-default alignment with zero behavioral impact beyond fixing the crash. The start_task_timeout=300 additions are example-side only and do not touch any library code. The new tests are additive. No existing API contracts are altered.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/app_common/ccwf/ccwf_job.py Single-line fix: min_clients default changed from None to int = 0, matching SwarmServerController and BaseServerCtl type contracts and eliminating the TypeError at construction time.
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py Passes start_task_timeout=300 to both SwarmServerConfig and CrossSiteEvalConfig, addressing the 10s timeout race against CIFAR-10/PyTorch model initialization on both the swarm and CSE workflows.
tests/unit_test/app_common/ccwf/test_ccwf_job_config.py New regression test file asserting SwarmServerConfig().min_clients == 0 and that an explicit value is preserved; prevents the None default from silently reappearing.

Sequence Diagram

sequenceDiagram
    participant Script as swarm_script_runner_cifar10.py
    participant SSC as SwarmServerConfig
    participant addSwarm as CCWFJob.add_swarm()
    participant SSCtrl as SwarmServerController
    participant BSCtl as BaseServerCtl.__init__

    Script->>SSC: "SwarmServerConfig(num_rounds=3, start_task_timeout=300)"
    Note over SSC: min_clients: int = 0 (fixed from None)
    Script->>addSwarm: add_swarm(server_config, client_config, cse_config)
    addSwarm->>SSCtrl: "SwarmServerController(min_clients=0, ...)"
    SSCtrl->>BSCtl: "BaseServerCtl.__init__(min_clients=0)"
    Note over BSCtl: if min_clients < 0: raise ValueError (no TypeError)

    Note over Script: start_task_timeout=300 swarm_start waits up to 300s
    Note over Script: CrossSiteEvalConfig(start_task_timeout=300) cse_start waits up to 300s
Loading

Reviews (2): Last reviewed commit: "Address Greptile review: also bump Cross..." | Re-trigger Greptile

…eout to 300s

Greptile noted that after fixing swarm_start, the CSE workflow that runs
after swarm completes would still hit the same race on slower hosts:
CrossSiteEvalConfig.start_task_timeout defaults to Constant.START_TASK_TIMEOUT
(10s), and add_cross_site_eval() forwards it into CrossSiteEvalServerController.
On a host where the swarm start_task_timeout was insufficient, the cse_start
task could time out for the same reason — clients still warming dataset/model
state — and abort the CSE workflow with FATAL_SYSTEM_ERROR.

Pass start_task_timeout=300 to CrossSiteEvalConfig in the swarm CIFAR10
example, matching the swarm_start fix above. Defense-in-depth — even though
my local test passed without this change (caches were warm by the time CSE
ran), the framework default is unsafe for the example's stated workload.

Verified: pytest L1_job_api_example_pt_swarm_script_executor_cifar10 still
passes in ~6:48 with this additional fix.

Signed-off-by: shaxie <shaxie@nvidia.com>
@nvshaxie
Copy link
Copy Markdown
Contributor Author

nvshaxie commented May 8, 2026

@greptile-apps Good catch — CrossSiteEvalConfig.start_task_timeout was indeed still at the 10s default. My local test happened to pass (caches were warm by the time CSE ran), but on a slower host or a cold cache the same race would surface on cse_start.

Pushed 80a48d3: pass start_task_timeout=300 to CrossSiteEvalConfig as well, with a comment cross-referencing the swarm_start fix above. Verified the test still passes (~6:48).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

The fix in 80a48d3 looks correct and complete. A few observations:

  1. start_task_timeout=300 — addresses exactly the cse_start race I flagged. The comment accurately describes the issue.

  2. eval_task_timeout=300 — this is a nice bonus. CrossSiteEvalConfig.eval_task_timeout defaults to 30 (not the 10s START_TASK_TIMEOUT), so this is a 30→300 bump. On a slow host with 2 clients evaluating each other's models, 30s per eval task could plausibly be tight for CIFAR-10. Worth keeping.

One minor nit: the inline comment on eval_task_timeout says "same race as swarm_start above", but eval_task_timeout governs per-evaluation task completion, not the handshake/startup race. The risk there is more "eval takes longer than 30s" than "client hasn't initialized yet." The comment is close enough for a reader to understand the intent, but if you want precision:

# eval_task_timeout default of 30s can be exceeded on a slow host running
# full CIFAR-10 inference; 300s matches the start_task_timeout budget above.
eval_task_timeout=300,

Not a blocker — the code itself is correct. LGTM on the substance.

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@nvshaxie this PR changes LGTM.
However the team prefers create fix PR against 2.8 branch first then cherry-pick them back to main, can you create the same PR to 2.8 first? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants