RLVLA
- Dual-System
Each module performs its designated role, endowing the system with generalization capabilities for training-free execution in unseen environments
- VLA-Planner + RL-Controller(vlp-humanoid)
- residual RL(PLD,self-improved VLA)
- VLA as Router,dynamically activate diverse RL skill policies(vlp-humanoid)
- RL in training pipeline
Leveraging RL to address compounding errors and OOD challenges in VLA models during long-horizon tasks.
- online/offline fine-tuning for auto-regression-based VLA(RL4VLA)
- preference optimization for diffusion-based VLA(DPPO)
- fine-tune flow-based VLA(πRL,Flow-noise/Flow-SDE)
- RL in inference
Leveraging RL to mitigate hallucinations and prevent hazardous actions in VLA models during critical physical contact phases
- Similar to MCTS, select actions with higher Q-values(V-VLAPS)
- VLA reversely empowers RL
VLA does not directly control, only guides RL
- VLA as reward design(Eureka)
- VLA as world model(RL in Latent Space)
- VLA as critic