Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554
Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
The CCWF cyclic example default start_task_timeout (10s, set in
nvflare/app_common/ccwf/common.py) is shorter than the time the starting
client needs to initialize the CIFAR-10 dataset and the PyTorch model
after responding to cyclic_config. As a result the server-side
CyclicServerController frequently times out the cyclic_start task and
aborts the run with a FATAL_SYSTEM_ERROR ("failed to start workflow
controller on client site-1"), even though the client is making forward
progress.
Set start_task_timeout=300 in the example, matching the pattern already
used in examples/advanced/swarm_learning/swarm_pt/job.py
(start_task_timeout=1200), so the example can complete its first round
on machines where dataset/model init takes longer than 10s.
Apply the same change to the cyclic notebook tutorial under
examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb.
Signed-off-by: shaxie <shaxie@nvidia.com>
Greptile SummaryThis PR fixes a long-standing
Confidence Score: 5/5Safe to merge — the change is a targeted two-line configuration fix that only affects example scripts and a tutorial notebook, with no changes to library code or runtime behaviour. Both changed sites use CyclicServerConfig, which explicitly accepts start_task_timeout and forwards it to CyclicServerController. The value 300s matches the pattern used by analogous examples in the repo, the inline comments accurately describe the root cause, and the notebook cell is kept in sync with the standalone script. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Server as CyclicServerController
participant Client as site-1 (starting client)
Server->>Client: cyclic_config task
Client-->>Server: ack (dataset + model init begins)
Note over Client: CIFAR-10 download<br/>PyTorch model init
Server->>Client: cyclic_start task
Note over Server: waits start_task_timeout<br/>(was 10s now 300s)
alt Before fix (10s default)
Note over Server: timeout fires while client still initializing
Server->>Server: FATAL_SYSTEM_ERROR
else After fix (300s)
Client-->>Server: cyclic_start ack (init complete)
Server->>Client: round 1 learn task
Client-->>Server: round 1 result
Note over Server,Client: rounds 2 to N
end
Reviews (1): Last reviewed commit: "Cyclic CIFAR10 example: set start_task_t..." | Re-trigger Greptile |
|
@nvshaxie this PR changes LGTM. |
Fixes # .
Description
The CCWF cyclic CIFAR10 example aborts with
FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1")becauseCyclicServerConfig.start_task_timeoutis left at the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10), which is shorter than the time the starting client needs to download CIFAR-10 and initialize the PyTorch model after responding tocyclic_config. The same issue is reproducible on bothmainand2.7.2, so this is a pre-existing example/default mismatch, not a regression.Summary
start_task_timeouttoCyclicServerConfig, so it inherits the framework default of 10s innvflare/app_common/ccwf/common.py.cyclic_config. The server-sideCyclicServerControllerthen times outcyclic_startand aborts the run withFATAL_SYSTEM_ERROR, even though the client is making forward progress.start_task_timeout=300in the example, matching the pattern already used inexamples/advanced/swarm_learning/swarm_pt/job.py(start_task_timeout=1200, with the comment "Timeouts sized for large-model loading and LoRA training").Changes
examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py: passstart_task_timeout=300toCyclicServerConfig.examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb: same change in the equivalent notebook cell.Repro
Before fix (on a 256-core / A100X host, both
mainand2.7.2):After fix: a 3-round CCWF run completes in ~9.5 minutes with no
FATAL_SYSTEM_ERRORlines in the resultingpt_cyclic/server/log.txt.Types of changes
./runtest.shdeferred to CI).