APPO trains tool-using reasoning agents by branching at procedural decision points.
APPO extends the ARPO agentic RL stack with procedure-aware branching: it scores candidate branch tokens inside a rollout, expands the most informative decision points, and maps branch outcomes back to credit on the original trajectory. Cold-start SFT, datasets, tool cache, and evaluation all follow ARPO.
ARPO branches when tool-call rounds become high-entropy. APPO locates procedure points via a Branching Score:
BS_t = Z(Entropy_t) * Z(FutureValue_t)
FutureValue_t = exp(Σ_{k >= t} γ^(k-t) · (log π_current(a_k|s_k) - log π_rollout(a_k|s_k)))
Branch continuations reuse the same tool loop as initial rollouts, provide reward/advantage signals, and are excluded from actor loss. Only initial rollout tokens are optimized.
Reproducing APPO follows the same three-stage pipeline as ARPO: cold-start SFT (optional) → APPO RL training → evaluation.
This stage helps reproduce the full pipeline. You can skip it if you already have a warm-started checkpoint.
cd LLaMA-Factory
conda create -n sft python=3.10
conda activate sft
pip install -r requirements.txt- Prepare the SFT dataset following ARPO and register it in
LLaMA-Factory/data/dataset_info.json. - Update
LLaMA-Factory/arpo_train_sft/yaml/qwen.yamlwith yourmodel_name_or_path,dataset, andoutput_dir. - Launch SFT:
bash APPO/scripts/sft_train.shAPPO RL builds on the same VERL + tool-use stack as ARPO. Datasets and search cache live under ARPO/.
conda create -n appo python=3.10
conda activate appo
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
cd APPO
pip install -r verl_arpo_entropy/requirements.txt
pip install -e verl_arpo_entropyData — use the same RL datasets as ARPO (see the ARPO README Data Preparation section). In this repo they are under ARPO/rl_datasets/.
Tool API
Replace Bright Data api_key and zone in:
APPO/scripts/config/ppo_trainer.yamlAPPO/scripts/config/ppo_trainer_dr.yamlAPPO/verl_arpo_entropy/verl/workers/rollout/tools/config_example.yaml
Update ACTOR_MODEL_PATH and SAVE_PATH in the launch script, then run:
cd APPO/scripts
# 7B reasoning (Qwen)
bash APPO_7B_Reasoning_1node.sh
# 8B deep search
bash APPO_8B_Deepsearch_1node.sh
# 14B deep search
bash APPO_14B_Deepsearch_1node.shLegacy entrypoints with cluster-specific paths are also available:
bash APPO_1node.sh # Qwen-style
bash APPO_1node_llama.sh # Llama-styleConvert VERL checkpoints to Hugging Face format:
bash APPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.shKey APPO settings
algorithm.adv_estimator=appo
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.mode=sync_with_tool
actor_rollout_ref.rollout.n=16
actor_rollout_ref.rollout.initial_rollouts=8
actor_rollout_ref.rollout.appo_dynamic_branching=True
actor_rollout_ref.rollout.reward_scale_discount=0.9
actor_rollout_ref.actor.policy_loss.loss_mode=future_klEvaluation reuses the ARPO evaluation/ pipeline unchanged.
cd evaluation/vllm_scripts
conda create -n vllm_env python=3.10
conda activate vllm_env
pip install -r requirements.txtEdit model paths in vllm_launch_reasoning_model_cuda4-7.sh and a summarization launch script, then start services:
bash vllm_launch_reasoning_model_cuda4-7.sh
bash vllm_launch_summarize_model_cuda0-3_qwen3_8b.shconda create -n evaluation python=3.10
conda activate evaluation
cd evaluation
pip install -r requirements.txtEdit evaluation/infer_local_sds.sh with your model path, output path, and Bing API credentials, then:
bash evaluation/infer_local_sds.shbash evaluation/deploy_qwen2.5_72B_instruct.sh
bash evaluation/evaluate_passk.shAPPO citation: coming soon.
If you use the ARPO codebase or datasets, please cite:
@article{dong2025arpo,
title = {Agentic Reinforced Policy Optimization},
author = {Guanting Dong and Hangyu Mao and Kai Ma and others},
journal = {CoRR},
volume = {abs/2507.19849},
year = {2025}
}