Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout by nvshaxie · Pull Request #4555 · NVIDIA/NVFlare

nvshaxie · 2026-05-08T08:17:00Z

Fixes # .

Description

The swarm CIFAR10 example fails out of the box on both NVFlare main and 2.7.2 with two distinct issues, both of which need to be fixed for the example to complete a single round.

Summary

Bug 1 — `SwarmServerConfig.min_clients` defaulted to `None`

add_swarm() forwards every server_config field — including this None — to SwarmServerController(min_clients=...), overriding the controller's typed default of int = 0. The value then reaches BaseServerCtl.__init__ where if min_clients < 0: raises:

TypeError: '<' not supported between instances of 'NoneType' and 'int'

The example crashes at construction time, before any simulator is launched.

The other Optional[List[str]] fields (aggr_clients, train_clients, result_clients) tolerate None because the controller normalizes them via if not x: x = []. min_clients has no such normalization — its type contract is int, and the documented "0 means all participating clients are required" semantics in BaseServerCtl already covers the "not specified" case.

Fix: change SwarmServerConfig.min_clients default from None to int = 0, matching the type and default of both SwarmServerController.__init__ and BaseServerCtl.__init__.

Bug 2 — `start_task_timeout` left at 10s default

swarm_script_runner_cifar10.py did not pass start_task_timeout, so the example used the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10). After Bug 1 is fixed, the simulator starts and configures both sites, but swarm_start times out after 10s while site-1 is still initializing CIFAR-10 + PyTorch model, and the run aborts with FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1").

Fix: pass start_task_timeout=300 in the example, matching the pattern already used in examples/advanced/swarm_learning/swarm_pt/job.py (start_task_timeout=1200). Sibling PR #4554 applies the same example-side fix to the cyclic CIFAR10 example.

Changes

nvflare/app_common/ccwf/ccwf_job.py: SwarmServerConfig.min_clients default None → int = 0.
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py: pass start_task_timeout=300 to SwarmServerConfig.
tests/unit_test/app_common/ccwf/test_ccwf_job_config.py: new regression test asserting SwarmServerConfig().min_clients == 0 so the default cannot silently regress to None.

Repro

Before fix (on a 256-core / A100X host, both main and 2.7.2):

cd examples/advanced/job_api/pt/
python swarm_script_runner_cifar10.py
# Python exits immediately:
# TypeError: '<' not supported between instances of 'NoneType' and 'int'
# in nvflare/app_common/ccwf/server_ctl.py:161 (BaseServerCtl.__init__).
# Workdir does not even contain a pt_swarm subdir.

After fix (Bug 1 only): the simulator starts, both sites are configured, then:

WFCommServer - INFO - task swarm_start exit with status TaskCompletionStatus.TIMEOUT
ServerRunner - ERROR - Aborting current RUN due to FATAL_SYSTEM_ERROR received:
                       failed to start workflow controller on client site-1

After both fixes: a 3-round swarm CCWF run completes in ~6:45 with no FATAL_SYSTEM_ERROR lines in the resulting pt_swarm/server/log.txt. All 92 existing ccwf unit tests still pass.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally (pytest tests/unit_test/app_common/ccwf/ 92 passed; black + isort + flake8 clean on changed files; full ./runtest.sh deferred to CI).
In-line docstrings updated.
Documentation updated.

…rt_task_timeout The swarm CIFAR10 example fails out of the box on both NVFlare main and 2.7.2 with two distinct issues, both of which need to be fixed for the example to complete a single round. (1) `SwarmServerConfig.min_clients` defaulted to None. `add_swarm()` forwards every server_config field — including this None — to `SwarmServerController(min_clients=...)`, overriding the controller's own typed default of `int = 0`. The value then reaches `BaseServerCtl.__init__` where `if min_clients < 0:` raises: `TypeError: '<' not supported between instances of 'NoneType' and 'int'` The example crashes at construction time, before any simulator is launched. The other Optional-list fields (`aggr_clients`, `train_clients`, `result_clients`) tolerate None because `SwarmServerController` and `BaseServerCtl` explicitly normalize them via `if not x: x = []`. `min_clients` has no such normalization — its type contract is `int`, and the documented "0 means all participating clients are required" semantics in `BaseServerCtl` already covers the "not specified" case. Fix: change `SwarmServerConfig.min_clients` default from `None` to `int = 0`, matching the type and default of both `SwarmServerController.__init__` and `BaseServerCtl.__init__`. (2) `swarm_script_runner_cifar10.py` did not pass start_task_timeout, so the example used the framework default of 10 s (`nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10`). On any host where the starting client takes longer than 10 s to download CIFAR-10 and initialize the PyTorch model after responding to swarm_config, the server-side `SwarmServerController` times out the swarm_start task and aborts the run with FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1") even though the client is still progressing. Fix: pass `start_task_timeout=300` in the example, matching the pattern already used in `examples/advanced/swarm_learning/swarm_pt/job.py`. Verified locally: with both fixes applied, the example completes a 3-round swarm CCWF run on a 256-core / A100X node in ~6:45 with no FATAL_SYSTEM_ERROR lines in the resulting `pt_swarm/server/log.txt`. Add a regression unit test asserting `SwarmServerConfig().min_clients == 0` so the default cannot silently regress to None. Signed-off-by: shaxie <shaxie@nvidia.com>

greptile-apps · 2026-05-08T08:19:04Z

Greptile Summary

This PR fixes two distinct bugs that together prevented the swarm CIFAR-10 example from running end-to-end: a TypeError crash at construction time due to min_clients=None, and a swarm_start timeout caused by an insufficient 10-second default during CIFAR-10/PyTorch initialization.

ccwf_job.py: SwarmServerConfig.min_clients default changed from None to int = 0, aligning it with SwarmServerController and BaseServerCtl type contracts and eliminating the TypeError on min_clients < 0.
swarm_script_runner_cifar10.py: start_task_timeout=300 added to both SwarmServerConfig and CrossSiteEvalConfig, covering both the swarm-start and CSE-start initialization races.
test_ccwf_job_config.py: New regression tests guard the min_clients default and verify explicit values pass through unchanged.

Confidence Score: 5/5

All three changes are targeted, mechanical fixes with no side-effects on existing paths; the regression test locks in the corrected default.

The min_clients: int = 0 change is a one-line type-default alignment with zero behavioral impact beyond fixing the crash. The start_task_timeout=300 additions are example-side only and do not touch any library code. The new tests are additive. No existing API contracts are altered.

No files require special attention.

Important Files Changed

Filename	Overview
nvflare/app_common/ccwf/ccwf_job.py	Single-line fix: `min_clients` default changed from `None` to `int = 0`, matching `SwarmServerController` and `BaseServerCtl` type contracts and eliminating the `TypeError` at construction time.
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py	Passes `start_task_timeout=300` to both `SwarmServerConfig` and `CrossSiteEvalConfig`, addressing the 10s timeout race against CIFAR-10/PyTorch model initialization on both the swarm and CSE workflows.
tests/unit_test/app_common/ccwf/test_ccwf_job_config.py	New regression test file asserting `SwarmServerConfig().min_clients == 0` and that an explicit value is preserved; prevents the `None` default from silently reappearing.

Sequence Diagram

sequenceDiagram
    participant Script as swarm_script_runner_cifar10.py
    participant SSC as SwarmServerConfig
    participant addSwarm as CCWFJob.add_swarm()
    participant SSCtrl as SwarmServerController
    participant BSCtl as BaseServerCtl.__init__

    Script->>SSC: "SwarmServerConfig(num_rounds=3, start_task_timeout=300)"
    Note over SSC: min_clients: int = 0 (fixed from None)
    Script->>addSwarm: add_swarm(server_config, client_config, cse_config)
    addSwarm->>SSCtrl: "SwarmServerController(min_clients=0, ...)"
    SSCtrl->>BSCtl: "BaseServerCtl.__init__(min_clients=0)"
    Note over BSCtl: if min_clients < 0: raise ValueError (no TypeError)

    Note over Script: start_task_timeout=300 swarm_start waits up to 300s
    Note over Script: CrossSiteEvalConfig(start_task_timeout=300) cse_start waits up to 300s

_{Reviews (2): Last reviewed commit: "Address Greptile review: also bump Cross..." | Re-trigger Greptile}

…eout to 300s Greptile noted that after fixing swarm_start, the CSE workflow that runs after swarm completes would still hit the same race on slower hosts: CrossSiteEvalConfig.start_task_timeout defaults to Constant.START_TASK_TIMEOUT (10s), and add_cross_site_eval() forwards it into CrossSiteEvalServerController. On a host where the swarm start_task_timeout was insufficient, the cse_start task could time out for the same reason — clients still warming dataset/model state — and abort the CSE workflow with FATAL_SYSTEM_ERROR. Pass start_task_timeout=300 to CrossSiteEvalConfig in the swarm CIFAR10 example, matching the swarm_start fix above. Defense-in-depth — even though my local test passed without this change (caches were warm by the time CSE ran), the framework default is unsafe for the example's stated workload. Verified: pytest L1_job_api_example_pt_swarm_script_executor_cifar10 still passes in ~6:48 with this additional fix. Signed-off-by: shaxie <shaxie@nvidia.com>

nvshaxie · 2026-05-08T08:36:06Z

@greptile-apps Good catch — CrossSiteEvalConfig.start_task_timeout was indeed still at the 10s default. My local test happened to pass (caches were warm by the time CSE ran), but on a slower host or a cold cache the same race would surface on cse_start.

Pushed 80a48d3: pass start_task_timeout=300 to CrossSiteEvalConfig as well, with a comment cross-referencing the swarm_start fix above. Verified the test still passes (~6:48).

greptile-apps · 2026-05-08T08:37:08Z

The fix in 80a48d3 looks correct and complete. A few observations:

start_task_timeout=300 — addresses exactly the cse_start race I flagged. The comment accurately describes the issue.
eval_task_timeout=300 — this is a nice bonus. CrossSiteEvalConfig.eval_task_timeout defaults to 30 (not the 10s START_TASK_TIMEOUT), so this is a 30→300 bump. On a slow host with 2 clients evaluating each other's models, 30s per eval task could plausibly be tight for CIFAR-10. Worth keeping.

One minor nit: the inline comment on eval_task_timeout says "same race as swarm_start above", but eval_task_timeout governs per-evaluation task completion, not the handshake/startup race. The risk there is more "eval takes longer than 30s" than "client hasn't initialized yet." The comment is close enough for a reader to understand the intent, but if you want precision:

# eval_task_timeout default of 30s can be exceeded on a slow host running
# full CIFAR-10 inference; 300s matches the start_task_timeout budget above.
eval_task_timeout=300,

Not a blocker — the code itself is correct. LGTM on the substance.

YuanTingHsieh · 2026-05-08T19:50:27Z

@nvshaxie this PR changes LGTM.
However the team prefers create fix PR against 2.8 branch first then cherry-pick them back to main, can you create the same PR to 2.8 first? thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555

Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555
nvshaxie wants to merge 2 commits intoNVIDIA:mainfrom
nvshaxie:fix/swarm-min-clients-and-start-timeout

nvshaxie commented May 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 8, 2026 •

edited

Loading

Uh oh!

nvshaxie commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026

Uh oh!

YuanTingHsieh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvshaxie commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Bug 1 — SwarmServerConfig.min_clients defaulted to None

Bug 2 — start_task_timeout left at 10s default

Changes

Repro

Types of changes

Uh oh!

greptile-apps Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

nvshaxie commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026

Uh oh!

YuanTingHsieh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvshaxie commented May 8, 2026 •

edited

Loading

Bug 1 — `SwarmServerConfig.min_clients` defaulted to `None`

Bug 2 — `start_task_timeout` left at 10s default

greptile-apps Bot commented May 8, 2026 •

edited

Loading