LLaVA-OneVision-1.5-RL

Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning

Project Leaders: Changrui Chen, Jiankang Deng
LLaVA-OneVision Community Contributors

We present the Reinforcement Learning (RL) post-training stage of LLaVA-OneVision-1.5. By leveraging a lightweight RL framework (GRPO) on top of the supervised instruct model, we effectively elicit latent reasoning capabilities. Using a curated dataset of only 67K examples driven by discrepancy-based selection, our model learns to generate explicit "Chain-of-Thought" reasoning traces. This approach significantly boosts performance on complex STEM, Coding, and Reasoning tasks without compromising general visual understanding.

RL Training Data

Discrepancy-Driven Data Selection

We curate the training data by measuring the divergence between Pass@N and Pass@1 performance on diverse benchmarks. A significant gap indicates that the model possesses the latent capability to solve the task, yet its policy distribution fails to reliably assign high probability to the correct reasoning path. Under this lens, RL serves as an elicitation mechanism rather than knowledge injection. This selection paradigm ensures high training efficiency by targeting the model's effective learnable boundary.

Reward-Based Sampling

To further filter the high-quality training instances, we employ a reward-based sampling strategy. We generate multiple candidate responses for each sample and retain only those where the average reward falls within a specified range. This filtering process effectively discards both trivial and unsolvable cases, biasing the corpus toward medium-difficulty instances.

RL Data Distribution

Figure: Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.

Reward System

Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules.

Rule-Based Rewards

Specific verification rules for different domains:

Category Source Reward Design Details
STEM ViRL39K Choice accuracy & math expression equivalence
Grounding Ref-L4, VigoRL-SA IoU between predicted/ref boxes; choice accuracy
Spatial VigoRL-SAT Choice accuracy
Counting PixmoCount Numeric token equivalence
Coding WebCode2M, UniSVG Token/tag overlap; SVG rendering similarity [0,1]
OCR InfoVQA Text similarity
Diagram AI2D Choice accuracy

Training Procedure

We employ Group Relative Policy Optimization (GRPO) within the AReaL asynchronous framework to maximize training efficiency. Our optimization objective omits the KL divergence penalty, relying instead on PPO-style clipping for stability. We employ a two-stage curriculum to better exploit the structure of our RL corpus:

Stage 1: Answer-only RL

Training exclusively on the normal split to reinforce output constraints.

Put ONLY your final answer within <answer></answer>.

Stabilizes performance on concise tasks.

Stage 2: Chain-of-Thought RL

Encouraging explicit reasoning traces on long-reasoning data.

Think and solve... within <think></think>...

Unlocks deeper reasoning capabilities.

To prevent forgetting short, perception-heavy skills, we interleave a small proportion of normal-set examples into Stage 2 mini-batches. This mixed-prompt curriculum allows LLaVA-OV-1.5-RL to simultaneously strengthen long-horizon reasoning and maintain strong performance on standard benchmarks.

Performance of RL Post-training

To validate the effectiveness of our lightweight RL post-training, we compare the RL-enhanced models against their supervised baselines and Qwen2.5-VL-7B.

1. Core Capability Enhancement

As presented in the table below, RL post-training yields consistent gains across major benchmarks. Multimodal Reasoning: Dramatic gains on WeMath (+7.9), MathVision (+8.8), and MMMU-Pro (+10.5) in "thinking" mode. General VQA: Maintains or slightly improves performance on MMBench and DocVQA.

Task Benchmark LLaVA-OV-1.5 LLaVA-OV-1.5 RL Qwen2.5-VL LLaVA-OV-1.5 Qwen2.5-VL LLaVA-OV
8B 8B 7B 4B 3B 7B
- thinking fast - - - -
General VQA MMStar 67.7 68.2↑0.5 68.3↑0.6 62.5 64.9 55.9 61.7
MMBenchen 84.1 85.7↑1.6 85.7↑1.6 83.4 84.2 78.0 82.5
MMBenchcn 81.0 84.2↑3.2 81.5↑0.5 81.6 76.9 74.6 81.4
MME-RealWorlden 61.7 63.4↑1.7 63.3↑1.6 57.3 61.6 51.6 57.4
MME-RealWorldcn 56.1 56.1↑0.0 56.3↑0.2 51.5 49.6 45.4 54.0
SeedBenchimage 77.3 76.7 77.6↑0.3 77.5 76.6 74.8 75.4
CV-Bench 80.7 82.9↑2.2 81.1↑0.4 80.0 77.2 71.5 77.9
SEED-Bench-2-Plus 69.2 69.5↑0.3 69.2↑0.0 70.9 68.9 68.6 64.9
RealWorldQA 68.1 68.4↑0.3 70.6↑2.5 68.5 67.8 60.0 66.3
Avg. 71.8 72.8↑1.0 72.6↑0.8 72.2 72.1 66.4 71.1
Reasoning MathVistamini 69.6 72.3↑2.7 71.8↑2.2 68.6 67.9 60.2 58.5
WeMath 61.5 69.4↑7.9 60.8 61.3 62.0 45.1 44.1
MathVision 25.6 34.4↑8.8 26.2↑0.6 22.4 24.2 21.3 18.5
MMMUval 55.4 58.8↑3.4 54.9 51.3 52.7 46.4 48.8
MMMU-Prostandard 37.4 39.9↑2.5 38.0↑0.6 36.3 35.3 31.1 28.0
MMMU-Provision 25.2 35.7↑10.5 29.0↑3.8 32.8 25.4 21.3 14.3
Avg. 45.8 51.8↑6.0 46.8↑1.0 45.5 44.6 37.6 35.4
OCR & Chart ChartQA 86.5 87.4↑0.9 87.0↑0.5 84.1 87.1 83.4 80.0
CharXivDQ 70.9 68.4 71.2↑0.3 69.8 63.8 58.2 47.6
DocVQA 95.0 91.9 95.0↑0.0 94.9 94.4 92.7 87.2
OCRBench 82.9 81.7 82.3 84.2 80.0 79.2 62.1
AI2Dw M 84.2 83.7 84.3↑0.1 82.6 83.6 78.6 81.4
AI2Dw/o M 94.1 93.7 93.9 93.4 93.3 90.7 90.8
InfoVQA 78.4 76.6 78.7↑0.3 81.7 76.1 75.6 68.8
Avg. 84.6 83.3 84.6↑0.0 84.4 82.6 79.8 74.0
Others PixmoCount 62.2 65.7↑3.5 71.1↑8.9 63.3 52.2 50.9 49.3
CountBench 88.2 86.8 88.6↑0.4 86.4 79.8 72.5 78.4
VL-RewardBench 47.7 44.0 49.7↑2.0 49.7 48.2 42.1 44.5
V* 78.0 79.1↑1.1 78.0↑0.0 77.0 74.9 69.6 72.3
Avg. 69.0 66.0 71.6↑2.6 69.1 63.8 58.8 61.1

Note: LLaVA-OV-1.5 RL (8B) results are highlighted. ↑ denotes improvement over the SFT baseline.

2. Extended Capability Analysis

Beyond the core benchmarks, we analyze specific vertical capabilities:
Spatial & Grounding: RL (Fast mode) significantly enhances fine-grained perception, consistently outperforming the baseline on SAT and Ref-L4.
Coding: "Thinking" mode achieves the highest scores on Design2Code and UniSVG, indicating that chain-of-thought reasoning aids structural code generation.

Performance Comparison Chart

Figure: Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.

Development Roadmap

LLaVA-OneVision-1.5-RL represents the latest chapter in our multimodal research, eliciting reasoning capabilities on top of the robust LLaVA-OV-1.5 foundation.

Stage 1 & 1.5

Pre-training & Mid-training

Large-scale visual-language alignment with efficient offline sample packing strategy on 85M multimodal samples. Establishing a robust visual-language foundation.

Stage 2

LLaVA-OneVision-1.5 (SFT)

Visual instruction tuning on 22M carefully curated multimodal instruction-following samples across 7 categories. Achieving strong general visual understanding and task transfer capabilities.
View Base Model Repository

Current Release

LLaVA-OneVision-1.5-RL

Post-Training Reinforcement Learning (This Work)
Applying GRPO with rule-based rewards on 67K curated samples to elicit latent reasoning (Chain-of-Thought). Significant gains in Math, Coding, and complex Reasoning tasks, while maintaining strong performance on General VQA and perception-heavy benchmarks.

Citation

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},  
  year={2025}
}

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}

Acknowledgement

We acknowledge the support and thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:

  • AReaL: Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible. — AReaL
  • sglang: SGLang is a fast serving framework for large language models and vision language models. — sglang
  • lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval
  • LLaVA: Large Language-and-Vision Assistant — LLaVA
  • LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT