From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
Abstract
Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
Community
This paper proposes PRIMO R1, a model that leverages Reinforcement Learning to elicit the zero-shot reasoning capabilities of Video MLLMs, enabling them to estimate task progress and identify robot execution errors without the need for external reference videos.
PRIMO R1 reduces mean absolute error in progress estimation by 50% and achieves 67.0% accuracy on RoboFail, surpassing closed-source models including OpenAI o1, with model weights already open-sourced on Hugging Face.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models (2026)
- AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models (2026)
- DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models (2026)
- From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification (2026)
- TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models (2026)
- Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models (2026)
- World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper