Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default by nvshaxie · Pull Request #4554 · NVIDIA/NVFlare

nvshaxie · 2026-05-08T07:45:53Z

Fixes # .

Description

The CCWF cyclic CIFAR10 example aborts with FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1") because CyclicServerConfig.start_task_timeout is left at the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10), which is shorter than the time the starting client needs to download CIFAR-10 and initialize the PyTorch model after responding to cyclic_config. The same issue is reproducible on both main and 2.7.2, so this is a pre-existing example/default mismatch, not a regression.

Summary

The cyclic CIFAR10 example does not pass start_task_timeout to CyclicServerConfig, so it inherits the framework default of 10s in nvflare/app_common/ccwf/common.py.
10s is too short for the starting client to download CIFAR-10 and init the PyTorch model after responding to cyclic_config. The server-side CyclicServerController then times out cyclic_start and aborts the run with FATAL_SYSTEM_ERROR, even though the client is making forward progress.
Set start_task_timeout=300 in the example, matching the pattern already used in examples/advanced/swarm_learning/swarm_pt/job.py (start_task_timeout=1200, with the comment "Timeouts sized for large-model loading and LoRA training").
Apply the same change to the cyclic notebook tutorial.

Changes

examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py: pass start_task_timeout=300 to CyclicServerConfig.
examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb: same change in the equivalent notebook cell.

Repro

Before fix (on a 256-core / A100X host, both main and 2.7.2):

cd examples/advanced/job_api/pt/
python cyclic_cc_script_runner_cifar10.py
grep "FATAL_SYSTEM_ERROR\|cyclic_start" /tmp/nvflare/jobs/workdir/pt_cyclic/server/log.txt
# produces:
#   WFCommServer - INFO - task cyclic_start exit with status TaskCompletionStatus.TIMEOUT
#   ServerRunner - ERROR - Aborting current RUN due to FATAL_SYSTEM_ERROR received:
#                          failed to start workflow controller on client site-1
# Stable repro across multiple attempts on a clean system.

After fix: a 3-round CCWF run completes in ~9.5 minutes with no FATAL_SYSTEM_ERROR lines in the resulting pt_cyclic/server/log.txt.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally (black + isort + flake8 on changed files; full ./runtest.sh deferred to CI).
In-line docstrings updated.
Documentation updated (cyclic tutorial notebook).

The CCWF cyclic example default start_task_timeout (10s, set in nvflare/app_common/ccwf/common.py) is shorter than the time the starting client needs to initialize the CIFAR-10 dataset and the PyTorch model after responding to cyclic_config. As a result the server-side CyclicServerController frequently times out the cyclic_start task and aborts the run with a FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1"), even though the client is making forward progress. Set start_task_timeout=300 in the example, matching the pattern already used in examples/advanced/swarm_learning/swarm_pt/job.py (start_task_timeout=1200), so the example can complete its first round on machines where dataset/model init takes longer than 10s. Apply the same change to the cyclic notebook tutorial under examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb. Signed-off-by: shaxie <shaxie@nvidia.com>

greptile-apps · 2026-05-08T07:47:36Z

Greptile Summary

This PR fixes a long-standing FATAL_SYSTEM_ERROR in the CCWF cyclic CIFAR-10 example by raising start_task_timeout from the 10 s framework default to 300 s in CyclicServerConfig, giving clients enough time to download and initialize the dataset and PyTorch model before the server aborts the cyclic_start task.

cyclic_cc_script_runner_cifar10.py: Expands the CyclicServerConfig call to pass start_task_timeout=300 with an explanatory comment, consistent with the pattern used in the swarm learning example (start_task_timeout=1200).
cyclic_weight_transfer_example.ipynb: Applies the identical change to the notebook tutorial cell so both entry points stay in sync.

Confidence Score: 5/5

Safe to merge — the change is a targeted two-line configuration fix that only affects example scripts and a tutorial notebook, with no changes to library code or runtime behaviour.

Both changed sites use CyclicServerConfig, which explicitly accepts start_task_timeout and forwards it to CyclicServerController. The value 300s matches the pattern used by analogous examples in the repo, the inline comments accurately describe the root cause, and the notebook cell is kept in sync with the standalone script.

No files require special attention.

Important Files Changed

Filename	Overview
examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py	Adds start_task_timeout=300 to CyclicServerConfig, replacing the silent 10s default that caused FATAL_SYSTEM_ERROR when CIFAR-10 initialization exceeded that window
examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-7_algorithms_and_workflows/07.2_algorithms/07.2.2_cyclic/cyclic_weight_transfer_example.ipynb	Mirrors the same start_task_timeout=300 fix in the notebook tutorial cell, keeping the notebook and the standalone script consistent

Sequence Diagram

sequenceDiagram
    participant Server as CyclicServerController
    participant Client as site-1 (starting client)

    Server->>Client: cyclic_config task
    Client-->>Server: ack (dataset + model init begins)
    Note over Client: CIFAR-10 download<br/>PyTorch model init

    Server->>Client: cyclic_start task
    Note over Server: waits start_task_timeout<br/>(was 10s now 300s)

    alt Before fix (10s default)
        Note over Server: timeout fires while client still initializing
        Server->>Server: FATAL_SYSTEM_ERROR
    else After fix (300s)
        Client-->>Server: cyclic_start ack (init complete)
        Server->>Client: round 1 learn task
        Client-->>Server: round 1 result
        Note over Server,Client: rounds 2 to N
    end

_{Reviews (1): Last reviewed commit: "Cyclic CIFAR10 example: set start_task_t..." | Re-trigger Greptile}

YuanTingHsieh · 2026-05-08T19:50:59Z

@nvshaxie this PR changes LGTM.
However the team prefers create fix PR against 2.8 branch first then cherry-pick them back to main, can you create the same PR to 2.8 first? thanks!

nvshaxie mentioned this pull request May 8, 2026

Fix swarm CIFAR10 example: SwarmServerConfig.min_clients default + start_task_timeout #4555

Open

6 tasks

YuanTingHsieh requested a review from holgerroth May 8, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554

Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554
nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
nvshaxie:fix/cyclic-start-task-timeout

nvshaxie commented May 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 8, 2026

Uh oh!

YuanTingHsieh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvshaxie commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Changes

Repro

Types of changes

Uh oh!

greptile-apps Bot commented May 8, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

YuanTingHsieh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvshaxie commented May 8, 2026 •

edited

Loading