Skip to content

Add adaption example to Auto_FL#4560

Open
ZiyueXu77 wants to merge 8 commits intoNVIDIA:mainfrom
ZiyueXu77:auto_fl
Open

Add adaption example to Auto_FL#4560
ZiyueXu77 wants to merge 8 commits intoNVIDIA:mainfrom
ZiyueXu77:auto_fl

Conversation

@ZiyueXu77
Copy link
Copy Markdown
Collaborator

Fixes # .

Description

Add a section and example to show how the existing concept of auto_fl can be adopted to a new task and execution environment

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings May 8, 2026 19:02
Comment thread research/auto-fl-research/vlm_local/client.py Fixed
Comment thread research/auto-fl-research/vlm_local/train_utils.py Fixed
Comment thread research/auto-fl-research/vlm_local/train_utils.py Fixed
Comment thread research/auto-fl-research/vlm_local/train_utils.py Fixed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR adds a vlm_local task profile showing how to adapt the Auto-FL research loop to a 3-site medical VLM federated learning scenario (Qwen3-VL LoRA on VQA-RAD, SLAKE, PathVQA) on a local single-GPU machine, reusing all shared scripts, ledger helpers, and aggregation utilities from the parent directory.

  • New profile files: client.py implements FedProx, FedDyn, and SAM training paths; job.py correctly forwards all client arguments including SAM/FedDyn args; model.py exposes only LoRA adapter tensors for efficient NVFlare aggregation; data/med_vlm_data_utils.py maps sites to VQA datasets.
  • Parent harness changes: extract_score.py prepends token_f1 to the metric key list; run_iteration.sh is parameterized with JOB_SCRIPT and CLIENT_CONTRACT_PATH so profiles can reuse the runner without copying it.

Confidence Score: 5/5

Safe to merge; changes are additive example files and a small backward-compatible parameterization of the shared runner script.

All changed files are either documentation, a new research example subdirectory, or a minor runner parameterization. Previously flagged bugs (torch_dtype parameter name, missing SAM/FedDyn forwarding) have been corrected. Remaining findings are defensive-coding suggestions with no impact on correctness for the normal execution path.

vlm_local/data/med_vlm_data_utils.py has a hardcoded Phase_3.1 subdirectory path worth tracking if the VLM_Benchmark repo structure changes.

Important Files Changed

Filename Overview
research/auto-fl-research/vlm_local/client.py Full NVFlare FL client for 3-site medical VLM LoRA training; implements FedProx, FedDyn, and SAM regularization paths. Uses correct torch_dtype= parameter. SAM/FedDyn args are forwarded from job.py. One silent-zero-division pattern in avg_loss computation when micro_steps=0.
research/auto-fl-research/vlm_local/data/med_vlm_data_utils.py Medical VQA dataset bridge; correctly guards answers[0] with empty-list check. Hardcodes Phase_3.1 subdirectory when extending sys.path, fragile against VLM_Benchmark repo reorganization.
research/auto-fl-research/vlm_local/job.py NVFlare FedAvgRecipe job generator; correctly forwards all client args including sam_rho, sam_eps, and feddyn_alpha through build_train_args.
research/auto-fl-research/vlm_local/model.py Adapter-only state model for NVFlare aggregation; correctly isolates LoRA tensors with sanitized parameter names and guards RNG state around model init.
research/auto-fl-research/vlm_local/train_utils.py Training utilities; correctly raises on empty dataset. Fallback token_f1 catches ImportError only. Silent avg_loss=0 path if micro_steps==0 is guarded by main().
research/auto-fl-research/scripts/run_iteration.sh Parameterizes JOB_SCRIPT and CLIENT_CONTRACT_PATH so vlm_local and future task profiles can reuse the parent runner without copying it.
research/auto-fl-research/scripts/extract_score.py Prepends token_f1 to METRIC_KEYS so the shared score extractor finds VLM evaluation results before falling back to CIFAR-10 accuracy keys.

Sequence Diagram

sequenceDiagram
    participant Runner as run_iteration.sh
    participant Job as vlm_local/job.py
    participant NVFlare as NVFlare Simulator
    participant Client as vlm_local/client.py
    participant VLM as Qwen3-VL + PEFT LoRA
    participant Server as NVFlare Server

    Runner->>Job: python $JOB_SCRIPT --cross_site_eval [args]
    Job->>Job: resolve_qwen3vl_adapter_shape()
    Job->>NVFlare: FedAvgRecipe.execute(SimEnv)

    loop num_rounds
        NVFlare->>Client: flare.receive() global adapter state
        Client->>VLM: load_state_dict via adapter_state_to_peft_state()
        Client->>VLM: local LoRA training (FedProx / FedDyn / SAM)
        Client->>Client: compute_model_diff(model, global_model)
        Client->>NVFlare: flare.send(ParamsType.DIFF + NUM_STEPS)
        NVFlare->>Server: aggregate DIFF tensors (weighted)
    end

    NVFlare->>Client: flare.is_evaluate() cross-site eval
    Client->>VLM: evaluate_vlm_generative() token_f1
    Client->>NVFlare: flare.send(metrics token_f1)
    NVFlare-->>Job: result_dir
    Job->>Runner: write AUTOFL_RESULT_DIR_FILE sidecar
Loading

Reviews (7): Last reviewed commit: "Handle empty VLM validation answers" | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “vlm_local” task profile under research/auto-fl-research/ to demonstrate how the existing Auto-FL NVFlare harness can be adapted to a local single-GPU, 3-site medical VLM LoRA-adapter workflow, while reusing the parent harness scripts/templates/aggregators.

Changes:

  • Introduces a VLM-local profile (vlm_local/) with its own client loop, job generator, adapter-only model state, dataset bridge, metric utilities, and mutation schema.
  • Updates the parent scripts/run_iteration.sh to allow selecting alternate profile entrypoints via JOB_SCRIPT and CLIENT_CONTRACT_PATH.
  • Expands the parent research/auto-fl-research/README.md with documentation on using and adapting the new VLM profile.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
research/auto-fl-research/vlm_local/train_utils.py Adds VLM evaluation (generative token-F1) and DIFF helper utilities.
research/auto-fl-research/vlm_local/requirements.txt Defines Python dependencies for running the VLM-local profile.
research/auto-fl-research/vlm_local/README.md Documents how the VLM profile layers onto the parent Auto-FL harness.
research/auto-fl-research/vlm_local/program.md Defines the VLM profile contract, scope, and fixed baseline budget.
research/auto-fl-research/vlm_local/mutation_schema.yaml Constrains the mutation/edit surface for the VLM profile.
research/auto-fl-research/vlm_local/model.py Adds an adapter-only (LoRA) server-side state model for aggregation.
research/auto-fl-research/vlm_local/job.py Adds a Recipe-based job generator for the local 3-site medical VLM simulation.
research/auto-fl-research/vlm_local/data/med_vlm_data_utils.py Adds deterministic site→dataset mapping and dataset/collator wiring to VLM_Benchmark.
research/auto-fl-research/vlm_local/data/init.py Declares the VLM profile data package.
research/auto-fl-research/vlm_local/client.py Implements the NVFlare client loop for adapter DIFF training/evaluation on VLM.
research/auto-fl-research/scripts/run_iteration.sh Adds env-overridable job/client paths for profile-based runs.
research/auto-fl-research/README.md Adds documentation for running and adapting the VLM-local profile.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread research/auto-fl-research/vlm_local/train_utils.py Outdated
Comment thread research/auto-fl-research/vlm_local/data/med_vlm_data_utils.py Outdated
Comment thread research/auto-fl-research/vlm_local/client.py
Comment thread research/auto-fl-research/vlm_local/client.py
Comment thread research/auto-fl-research/vlm_local/client.py
Comment thread research/auto-fl-research/vlm_local/train_utils.py Outdated
@ZiyueXu77 ZiyueXu77 requested a review from holgerroth May 8, 2026 19:47
Comment thread research/auto-fl-research/README.md Outdated
Comment thread research/auto-fl-research/vlm_local/data/med_vlm_data_utils.py Outdated
@ZiyueXu77 ZiyueXu77 requested a review from holgerroth May 8, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants