Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555
Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout#4555nvshaxie wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
…rt_task_timeout
The swarm CIFAR10 example fails out of the box on both NVFlare main and
2.7.2 with two distinct issues, both of which need to be fixed for the
example to complete a single round.
(1) `SwarmServerConfig.min_clients` defaulted to None.
`add_swarm()` forwards every server_config field — including this None —
to `SwarmServerController(min_clients=...)`, overriding the controller's
own typed default of `int = 0`. The value then reaches
`BaseServerCtl.__init__` where `if min_clients < 0:` raises:
`TypeError: '<' not supported between instances of 'NoneType' and 'int'`
The example crashes at construction time, before any simulator is
launched.
The other Optional-list fields (`aggr_clients`, `train_clients`,
`result_clients`) tolerate None because `SwarmServerController` and
`BaseServerCtl` explicitly normalize them via `if not x: x = []`.
`min_clients` has no such normalization — its type contract is `int`,
and the documented "0 means all participating clients are required"
semantics in `BaseServerCtl` already covers the "not specified" case.
Fix: change `SwarmServerConfig.min_clients` default from `None` to
`int = 0`, matching the type and default of both
`SwarmServerController.__init__` and `BaseServerCtl.__init__`.
(2) `swarm_script_runner_cifar10.py` did not pass start_task_timeout, so
the example used the framework default of 10 s
(`nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10`). On any
host where the starting client takes longer than 10 s to download CIFAR-10
and initialize the PyTorch model after responding to swarm_config, the
server-side `SwarmServerController` times out the swarm_start task and
aborts the run with FATAL_SYSTEM_ERROR ("failed to start workflow
controller on client site-1") even though the client is still progressing.
Fix: pass `start_task_timeout=300` in the example, matching the pattern
already used in `examples/advanced/swarm_learning/swarm_pt/job.py`.
Verified locally: with both fixes applied, the example completes a 3-round
swarm CCWF run on a 256-core / A100X node in ~6:45 with no
FATAL_SYSTEM_ERROR lines in the resulting `pt_swarm/server/log.txt`.
Add a regression unit test asserting `SwarmServerConfig().min_clients == 0`
so the default cannot silently regress to None.
Signed-off-by: shaxie <shaxie@nvidia.com>
Greptile SummaryThis PR fixes two distinct bugs that together prevented the swarm CIFAR-10 example from running end-to-end: a
Confidence Score: 5/5All three changes are targeted, mechanical fixes with no side-effects on existing paths; the regression test locks in the corrected default. The No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Script as swarm_script_runner_cifar10.py
participant SSC as SwarmServerConfig
participant addSwarm as CCWFJob.add_swarm()
participant SSCtrl as SwarmServerController
participant BSCtl as BaseServerCtl.__init__
Script->>SSC: "SwarmServerConfig(num_rounds=3, start_task_timeout=300)"
Note over SSC: min_clients: int = 0 (fixed from None)
Script->>addSwarm: add_swarm(server_config, client_config, cse_config)
addSwarm->>SSCtrl: "SwarmServerController(min_clients=0, ...)"
SSCtrl->>BSCtl: "BaseServerCtl.__init__(min_clients=0)"
Note over BSCtl: if min_clients < 0: raise ValueError (no TypeError)
Note over Script: start_task_timeout=300 swarm_start waits up to 300s
Note over Script: CrossSiteEvalConfig(start_task_timeout=300) cse_start waits up to 300s
Reviews (2): Last reviewed commit: "Address Greptile review: also bump Cross..." | Re-trigger Greptile |
…eout to 300s Greptile noted that after fixing swarm_start, the CSE workflow that runs after swarm completes would still hit the same race on slower hosts: CrossSiteEvalConfig.start_task_timeout defaults to Constant.START_TASK_TIMEOUT (10s), and add_cross_site_eval() forwards it into CrossSiteEvalServerController. On a host where the swarm start_task_timeout was insufficient, the cse_start task could time out for the same reason — clients still warming dataset/model state — and abort the CSE workflow with FATAL_SYSTEM_ERROR. Pass start_task_timeout=300 to CrossSiteEvalConfig in the swarm CIFAR10 example, matching the swarm_start fix above. Defense-in-depth — even though my local test passed without this change (caches were warm by the time CSE ran), the framework default is unsafe for the example's stated workload. Verified: pytest L1_job_api_example_pt_swarm_script_executor_cifar10 still passes in ~6:48 with this additional fix. Signed-off-by: shaxie <shaxie@nvidia.com>
|
@greptile-apps Good catch — Pushed 80a48d3: pass |
|
The fix in 80a48d3 looks correct and complete. A few observations:
One minor nit: the inline comment on # eval_task_timeout default of 30s can be exceeded on a slow host running
# full CIFAR-10 inference; 300s matches the start_task_timeout budget above.
eval_task_timeout=300,Not a blocker — the code itself is correct. LGTM on the substance. |
|
@nvshaxie this PR changes LGTM. |
Fixes # .
Description
The swarm CIFAR10 example fails out of the box on both NVFlare
mainand2.7.2with two distinct issues, both of which need to be fixed for the example to complete a single round.Summary
Bug 1 —
SwarmServerConfig.min_clientsdefaulted toNoneadd_swarm()forwards everyserver_configfield — including thisNone— toSwarmServerController(min_clients=...), overriding the controller's typed default ofint = 0. The value then reachesBaseServerCtl.__init__whereif min_clients < 0:raises:The example crashes at construction time, before any simulator is launched.
The other
Optional[List[str]]fields (aggr_clients,train_clients,result_clients) tolerateNonebecause the controller normalizes them viaif not x: x = [].min_clientshas no such normalization — its type contract isint, and the documented "0 means all participating clients are required" semantics inBaseServerCtlalready covers the "not specified" case.Fix: change
SwarmServerConfig.min_clientsdefault fromNonetoint = 0, matching the type and default of bothSwarmServerController.__init__andBaseServerCtl.__init__.Bug 2 —
start_task_timeoutleft at 10s defaultswarm_script_runner_cifar10.pydid not passstart_task_timeout, so the example used the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10). After Bug 1 is fixed, the simulator starts and configures both sites, butswarm_starttimes out after 10s while site-1 is still initializing CIFAR-10 + PyTorch model, and the run aborts withFATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1").Fix: pass
start_task_timeout=300in the example, matching the pattern already used inexamples/advanced/swarm_learning/swarm_pt/job.py(start_task_timeout=1200). Sibling PR #4554 applies the same example-side fix to the cyclic CIFAR10 example.Changes
nvflare/app_common/ccwf/ccwf_job.py:SwarmServerConfig.min_clientsdefaultNone→int = 0.examples/advanced/job_api/pt/swarm_script_runner_cifar10.py: passstart_task_timeout=300toSwarmServerConfig.tests/unit_test/app_common/ccwf/test_ccwf_job_config.py: new regression test assertingSwarmServerConfig().min_clients == 0so the default cannot silently regress toNone.Repro
Before fix (on a 256-core / A100X host, both
mainand2.7.2):After fix (Bug 1 only): the simulator starts, both sites are configured, then:
After both fixes: a 3-round swarm CCWF run completes in ~6:45 with no
FATAL_SYSTEM_ERRORlines in the resultingpt_swarm/server/log.txt. All 92 existing ccwf unit tests still pass.Types of changes
pytest tests/unit_test/app_common/ccwf/92 passed; black + isort + flake8 clean on changed files; full./runtest.shdeferred to CI).