Skip to content

Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554

Open
nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
nvshaxie:fix/cyclic-start-task-timeout
Open

Set start_task_timeout=300 in cyclic CIFAR10 example to avoid 10s default#4554
nvshaxie wants to merge 1 commit intoNVIDIA:mainfrom
nvshaxie:fix/cyclic-start-task-timeout

Conversation

@nvshaxie
Copy link
Copy Markdown
Contributor

@nvshaxie nvshaxie commented May 8, 2026

Fixes # .

Description

The CCWF cyclic CIFAR10 example aborts with FATAL_SYSTEM_ERROR ("failed to start workflow controller on client site-1") because CyclicServerConfig.start_task_timeout is left at the framework default of 10s (nvflare/app_common/ccwf/common.py:81 START_TASK_TIMEOUT = 10), which is shorter than the time the starting client needs to download CIFAR-10 and initialize the PyTorch model after responding to cyclic_config. The same issue is reproducible on both main and 2.7.2, so this is a pre-existing example/default mismatch, not a regression.

Summary

  • The cyclic CIFAR10 example does not pass start_task_timeout to CyclicServerConfig, so it inherits the framework default of 10s in nvflare/app_common/ccwf/common.py.
  • 10s is too short for the starting client to download CIFAR-10 and init the PyTorch model after responding to cyclic_config. The server-side CyclicServerController then times out cyclic_start and aborts the run with FATAL_SYSTEM_ERROR, even though the client is making forward progress.
  • Set start_task_timeout=300 in the example, matching the pattern already used in examples/advanced/swarm_learning/swarm_pt/job.py (start_task_timeout=1200, with the comment "Timeouts sized for large-model loading and LoRA training").
  • Apply the same change to the cyclic notebook tutorial.

Changes

  • examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py: pass start_task_timeout=300 to CyclicServerConfig.
  • examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb: same change in the equivalent notebook cell.

Repro

Before fix (on a 256-core / A100X host, both main and 2.7.2):

cd examples/advanced/job_api/pt/
python cyclic_cc_script_runner_cifar10.py
grep "FATAL_SYSTEM_ERROR\|cyclic_start" /tmp/nvflare/jobs/workdir/pt_cyclic/server/log.txt
# produces:
#   WFCommServer - INFO - task cyclic_start exit with status TaskCompletionStatus.TIMEOUT
#   ServerRunner - ERROR - Aborting current RUN due to FATAL_SYSTEM_ERROR received:
#                          failed to start workflow controller on client site-1
# Stable repro across multiple attempts on a clean system.

After fix: a 3-round CCWF run completes in ~9.5 minutes with no FATAL_SYSTEM_ERROR lines in the resulting pt_cyclic/server/log.txt.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally (black + isort + flake8 on changed files; full ./runtest.sh deferred to CI).
  • In-line docstrings updated.
  • Documentation updated (cyclic tutorial notebook).

The CCWF cyclic example default start_task_timeout (10s, set in
nvflare/app_common/ccwf/common.py) is shorter than the time the starting
client needs to initialize the CIFAR-10 dataset and the PyTorch model
after responding to cyclic_config. As a result the server-side
CyclicServerController frequently times out the cyclic_start task and
aborts the run with a FATAL_SYSTEM_ERROR ("failed to start workflow
controller on client site-1"), even though the client is making forward
progress.

Set start_task_timeout=300 in the example, matching the pattern already
used in examples/advanced/swarm_learning/swarm_pt/job.py
(start_task_timeout=1200), so the example can complete its first round
on machines where dataset/model init takes longer than 10s.

Apply the same change to the cyclic notebook tutorial under
examples/tutorials/.../07.2.2_cyclic/cyclic_weight_transfer_example.ipynb.

Signed-off-by: shaxie <shaxie@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR fixes a long-standing FATAL_SYSTEM_ERROR in the CCWF cyclic CIFAR-10 example by raising start_task_timeout from the 10 s framework default to 300 s in CyclicServerConfig, giving clients enough time to download and initialize the dataset and PyTorch model before the server aborts the cyclic_start task.

  • cyclic_cc_script_runner_cifar10.py: Expands the CyclicServerConfig call to pass start_task_timeout=300 with an explanatory comment, consistent with the pattern used in the swarm learning example (start_task_timeout=1200).
  • cyclic_weight_transfer_example.ipynb: Applies the identical change to the notebook tutorial cell so both entry points stay in sync.

Confidence Score: 5/5

Safe to merge — the change is a targeted two-line configuration fix that only affects example scripts and a tutorial notebook, with no changes to library code or runtime behaviour.

Both changed sites use CyclicServerConfig, which explicitly accepts start_task_timeout and forwards it to CyclicServerController. The value 300s matches the pattern used by analogous examples in the repo, the inline comments accurately describe the root cause, and the notebook cell is kept in sync with the standalone script.

No files require special attention.

Important Files Changed

Filename Overview
examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py Adds start_task_timeout=300 to CyclicServerConfig, replacing the silent 10s default that caused FATAL_SYSTEM_ERROR when CIFAR-10 initialization exceeded that window
examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-7_algorithms_and_workflows/07.2_algorithms/07.2.2_cyclic/cyclic_weight_transfer_example.ipynb Mirrors the same start_task_timeout=300 fix in the notebook tutorial cell, keeping the notebook and the standalone script consistent

Sequence Diagram

sequenceDiagram
    participant Server as CyclicServerController
    participant Client as site-1 (starting client)

    Server->>Client: cyclic_config task
    Client-->>Server: ack (dataset + model init begins)
    Note over Client: CIFAR-10 download<br/>PyTorch model init

    Server->>Client: cyclic_start task
    Note over Server: waits start_task_timeout<br/>(was 10s now 300s)

    alt Before fix (10s default)
        Note over Server: timeout fires while client still initializing
        Server->>Server: FATAL_SYSTEM_ERROR
    else After fix (300s)
        Client-->>Server: cyclic_start ack (init complete)
        Server->>Client: round 1 learn task
        Client-->>Server: round 1 result
        Note over Server,Client: rounds 2 to N
    end
Loading

Reviews (1): Last reviewed commit: "Cyclic CIFAR10 example: set start_task_t..." | Re-trigger Greptile

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@nvshaxie this PR changes LGTM.
However the team prefers create fix PR against 2.8 branch first then cherry-pick them back to main, can you create the same PR to 2.8 first? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants