Abstract
SeeUPO is a critic-free reinforcement learning method that ensures convergence guarantees in multi-turn agent interactions by modeling sequential decision-making as multi-agent bandit problems and using backward induction for policy updates.
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.
Community
Proposes SeeUPO, a critic-free, convergence-guaranteed sequence-level RL method for multi-turn agentic interactions, modeled as sequential bandits with backward induction, outperforming backbone RL algorithms in stability and gains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs (2025)
- MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning (2025)
- AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search (2026)
- A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization (2026)
- Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates (2026)
- Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning (2026)
- Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper