Papers
arxiv:2601.21363

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Published on Jan 29
Ā· Submitted by
Weidong Huang
on Feb 10
Authors:
,
,
,
,

Abstract

Off-policy Soft Actor-Critic with large-batch updates enables efficient humanoid locomotion policy pretraining, while model-based methods facilitate safe adaptation through deterministic data collection and stochastic exploration within physics-informed world models.

AI-generated summary

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Community

Paper author Paper submitter

Real-world Reinforcement Learning on Humanoid Robot

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

šŸ”— Project: https://lift-humanoid.github.io/
šŸ’» Code: https://github.com/bigai-ai/LIFT-humanoid

Paper author Paper submitter

Humanoids can dance and backflip, but they are still "frozen" in time. šŸ¤–

Current Sim2Real Reinforcement learning (RL) relies on massive Domain Randomization: train in the lab, deploy, and pray. But the moment friction changes or hardware wears down, a star athlete becomes a paperweight.

Why is real-world RL so hard?
1ļøāƒ£ Safety: Trial & error = broken hardware.
2ļøāƒ£ Efficiency: Real-world data is slow and expensive.

At ICLR 2026, we present LIFT:

  1. Pretrain the policy in simulation with off-policy RL (SAC).
  2. Learn a physics-informed world model from pretraining data.
  3. Real-world finetuning: collect data with a deterministic policy, while pushing stochastic exploration into the world model under constraints — reducing hardware risk and improving sample efficiency.
Paper author Paper submitter

Real World RL on Humanoid Robot

Towards Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21363 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21363 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21363 in a Space README.md to link it from this page.

Collections including this paper 1