LLaVA-OneVision-1.5-RL

We present the Reinforcement Learning (RL) post-training stage of LLaVA-OneVision-1.5. By leveraging a lightweight RL framework (GRPO) on top of the supervised instruct model, we effectively elicit latent reasoning capabilities. Using a curated dataset of only 67K examples driven by discrepancy-based selection, our model learns to generate explicit "Chain-of-Thought" reasoning traces. This approach significantly boosts performance on complex STEM, Coding, and Reasoning tasks without compromising general visual understanding.

RL Training Data

Discrepancy-Driven Data Selection

We curate the training data by measuring the divergence between Pass@N and Pass@1 performance on diverse benchmarks. A significant gap indicates that the model possesses the latent capability to solve the task, yet its policy distribution fails to reliably assign high probability to the correct reasoning path. Under this lens, RL serves as an elicitation mechanism rather than knowledge injection. This selection paradigm ensures high training efficiency by targeting the model's effective learnable boundary.

Reward-Based Sampling

To further filter the high-quality training instances, we employ a reward-based sampling strategy. We generate multiple candidate responses for each sample and retain only those where the average reward falls within a specified range. This filtering process effectively discards both trivial and unsolvable cases, biasing the corpus toward medium-difficulty instances.

Figure: Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.

Reward System

Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules.

Rule-Based Rewards

Specific verification rules for different domains:

Category	Source	Reward Design Details
STEM	ViRL39K	Choice accuracy & math expression equivalence
Grounding	Ref-L4, VigoRL-SA	IoU between predicted/ref boxes; choice accuracy
Spatial	VigoRL-SAT	Choice accuracy
Counting	PixmoCount	Numeric token equivalence
Coding	WebCode2M, UniSVG	Token/tag overlap; SVG rendering similarity [0,1]
OCR	InfoVQA	Text similarity
Diagram	AI2D	Choice accuracy

Training Procedure

We employ Group Relative Policy Optimization (GRPO) within the AReaL asynchronous framework to maximize training efficiency. Our optimization objective omits the KL divergence penalty, relying instead on PPO-style clipping for stability. We employ a two-stage curriculum to better exploit the structure of our RL corpus:

Stage 1: Answer-only RL

Training exclusively on the normal split to reinforce output constraints.

Put ONLY your final answer within <answer></answer>.

Stabilizes performance on concise tasks.

Stage 2: Chain-of-Thought RL

Encouraging explicit reasoning traces on long-reasoning data.

Think and solve... within <think></think>...

Unlocks deeper reasoning capabilities.

To prevent forgetting short, perception-heavy skills, we interleave a small proportion of normal-set examples into Stage 2 mini-batches. This mixed-prompt curriculum allows LLaVA-OV-1.5-RL to simultaneously strengthen long-horizon reasoning and maintain strong performance on standard benchmarks.

Performance of RL Post-training

To validate the effectiveness of our lightweight RL post-training, we compare the RL-enhanced models against their supervised baselines and Qwen2.5-VL-7B.

1. Core Capability Enhancement

As presented in the table below, RL post-training yields consistent gains across major benchmarks. Multimodal Reasoning: Dramatic gains on WeMath (+7.9), MathVision (+8.8), and MMMU-Pro (+10.5) in "thinking" mode. General VQA: Maintains or slightly improves performance on MMBench and DocVQA.

Task	Benchmark	LLaVA-OV-1.5	LLaVA-OV-1.5 RL		Qwen2.5-VL	LLaVA-OV-1.5	Qwen2.5-VL	LLaVA-OV
		8B	8B		7B	4B	3B	7B
		-	thinking	fast	-	-	-	-
General VQA	MMStar	67.7	68.2↑0.5	68.3↑0.6	62.5	64.9	55.9	61.7
	MMBench_en	84.1	85.7↑1.6	85.7↑1.6	83.4	84.2	78.0	82.5
	MMBench_cn	81.0	84.2↑3.2	81.5↑0.5	81.6	76.9	74.6	81.4
	MME-RealWorld_en	61.7	63.4↑1.7	63.3↑1.6	57.3	61.6	51.6	57.4
	MME-RealWorld_cn	56.1	56.1↑0.0	56.3↑0.2	51.5	49.6	45.4	54.0
	SeedBench_image	77.3	76.7	77.6↑0.3	77.5	76.6	74.8	75.4
	CV-Bench	80.7	82.9↑2.2	81.1↑0.4	80.0	77.2	71.5	77.9
	SEED-Bench-2-Plus	69.2	69.5↑0.3	69.2↑0.0	70.9	68.9	68.6	64.9
	RealWorldQA	68.1	68.4↑0.3	70.6↑2.5	68.5	67.8	60.0	66.3
	Avg.	71.8	72.8↑1.0	72.6↑0.8	72.2	72.1	66.4	71.1
Reasoning	MathVista_mini	69.6	72.3↑2.7	71.8↑2.2	68.6	67.9	60.2	58.5
	WeMath	61.5	69.4↑7.9	60.8	61.3	62.0	45.1	44.1
	MathVision	25.6	34.4↑8.8	26.2↑0.6	22.4	24.2	21.3	18.5
	MMMU_val	55.4	58.8↑3.4	54.9	51.3	52.7	46.4	48.8
	MMMU-Pro_standard	37.4	39.9↑2.5	38.0↑0.6	36.3	35.3	31.1	28.0
	MMMU-Pro_vision	25.2	35.7↑10.5	29.0↑3.8	32.8	25.4	21.3	14.3
	Avg.	45.8	51.8↑6.0	46.8↑1.0	45.5	44.6	37.6	35.4
OCR & Chart	ChartQA	86.5	87.4↑0.9	87.0↑0.5	84.1	87.1	83.4	80.0
	CharXiv_DQ	70.9	68.4	71.2↑0.3	69.8	63.8	58.2	47.6
	DocVQA	95.0	91.9	95.0↑0.0	94.9	94.4	92.7	87.2
	OCRBench	82.9	81.7	82.3	84.2	80.0	79.2	62.1
	AI2D_{w M}	84.2	83.7	84.3↑0.1	82.6	83.6	78.6	81.4
	AI2D_{w/o M}	94.1	93.7	93.9	93.4	93.3	90.7	90.8
	InfoVQA	78.4	76.6	78.7↑0.3	81.7	76.1	75.6	68.8
	Avg.	84.6	83.3	84.6↑0.0	84.4	82.6	79.8	74.0
Others	PixmoCount	62.2	65.7↑3.5	71.1↑8.9	63.3	52.2	50.9	49.3
	CountBench	88.2	86.8	88.6↑0.4	86.4	79.8	72.5	78.4
	VL-RewardBench	47.7	44.0	49.7↑2.0	49.7	48.2	42.1	44.5
	V*	78.0	79.1↑1.1	78.0↑0.0	77.0	74.9	69.6	72.3
	Avg.	69.0	66.0	71.6↑2.6	69.1	63.8	58.8	61.1

Note: LLaVA-OV-1.5 RL (8B) results are highlighted. ↑ denotes improvement over the SFT baseline.

2. Extended Capability Analysis

Beyond the core benchmarks, we analyze specific vertical capabilities:
Spatial & Grounding: RL (Fast mode) significantly enhances fine-grained perception, consistently outperforming the baseline on SAT and Ref-L4.
Coding: "Thinking" mode achieves the highest scores on Design2Code and UniSVG, indicating that chain-of-thought reasoning aids structural code generation.

Figure: Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.

Development Roadmap

LLaVA-OneVision-1.5-RL represents the latest chapter in our multimodal research, eliciting reasoning capabilities on top of the robust LLaVA-OV-1.5 foundation.

Stage 1 & 1.5

Pre-training & Mid-training

Large-scale visual-language alignment with efficient offline sample packing strategy on 85M multimodal samples. Establishing a robust visual-language foundation.

Stage 2

LLaVA-OneVision-1.5 (SFT)

Visual instruction tuning on 22M carefully curated multimodal instruction-following samples across 7 categories. Achieving strong general visual understanding and task transfer capabilities.
View Base Model Repository

Current Release