V-VLAPS: Value-Guided Vision-Language-Action
Planning and Search

Ke Ren^*†, Ali Salamatian^*†, Kieran Pattison^*, Cyrus Neary

University of British Columbia

^*Equal contribution ^†Project lead

Preprint · ICML 2026 DEMO Workshop

Overview of V-VLAPS. At each MCTS node, the current observation and language instruction are passed through a frozen VLA backbone (Octo) and a lightweight value head (MLP) to produce a scalar value estimate. This value enters the PUCT scoring rule to bias node selection toward higher-return branches (green), down-weighting low-value branches (red).

Abstract

Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods improve execution by using pretrained policies to guide tree search, yet node selection still depends heavily on policy priors and visit-count exploration. Consequently, when the policy favors poor actions, the planner lacks a learned value signal to correct this bias. Prior work has shown that VLA representations encode rollout success and failure information, suggesting that they may also support value estimation during planning. We introduce Value-Guided Vision-Language-Action Planning and Search (V-VLAPS), which augments VLA-guided planning with a lightweight value head trained on offline VLA rollouts to predict Monte Carlo returns. These predictions guide Monte Carlo Tree Search in simulation toward higher-value branches. Across five LIBERO suites, V-VLAPS matches value-free planning baseline at the default search budget in aggregate, and analysis shows that many hard failures are root-level timeouts where predicted values are weakly separated. With a larger search budget, V-VLAPS improves over the baseline in all task suites with +6 percentage points on LIBERO-Object and +4 percentage points on LIBERO-10.

Motivation

VLA models like Octo act reactively: at each step they map observations and a language instruction to an action chunk, without planning about future consequences. Under distribution shift or in long-horizon tasks, this brittleness causes hard-to-recover failures. Monte Carlo Tree Search (MCTS) addresses this by simulating candidate action sequences in a simulator before committing to one, but existing VLA-guided search (VLAPS) has no learned estimate of state value. Node selection relies only on the VLA action prior and a visit-count bonus, meaning if the policy assigns high probability to poor actions, the search has no way to detect this.

The key insight: VLA latent representations already encode task success/failure information. Prior work (SAFE) showed a small MLP probe on a VLA's frozen features can predict failure. We extend this from passive failure detection to active search guidance. A similar probe architecture, trained to predict Monte Carlo discounted returns, is fed into the PUCT scoring rule to bias the search.

Method

V-VLAPS has three components built on top of VLAPS:

Offline data collection: We roll out the frozen Octo VLA on LIBERO tasks and compute Monte Carlo discounted returns for successful episodes and 0 for failures. States are paired with these value targets to form a training set.
Value head training: A lightweight 3-layer MLP (~2.4M params) is trained via MSE regression on Octo readouts → scalar value. Training data is rebalanced to avoid the value-zero collapse caused by the heavy skew toward failed episodes.
Value-guided MCTS: The value prediction is added as the Q term in VLAPS's PUCT scoring rule, biasing node selection toward branches the value head predicts as leading to task completion.

t-SNE projection of Octo last-layer readouts on LIBERO-Object, colored by Monte Carlo value target. Successful and failed rollouts occupy visually distinct regions, validating that VLA representations carry value-relevant information.

Results

We evaluate three conditions on five LIBERO suites at two search budgets (600s and 1800s per episode, 100 episodes per cell). Both MCTS-based methods dramatically outperform the reactive Octo VLA (+27 pp average). At the default 600s budget, V-VLAPS matches VLAPS in aggregate (both 87.4%). With the extended 1800s budget, V-VLAPS pulls ahead by +6 pp on LIBERO-Object and +4 pp on LIBERO-10, the most difficult suites.

Suite	VLA (no planning)	600s budget		1800s budget
Suite	VLA (no planning)	VLAPS	V-VLAPS	VLAPS	V-VLAPS
libero_object	37	82	85	87	93
libero_spatial	81	96	95	96	97
libero_goal	88	92	93	90	93
libero_10	38	77	75	81	85
libero_90	57	90	89	88	90
Avg.	60.2	87.4	87.4	88.4	91.6

Success rate (%) by method and suite. Each cell aggregates 100 episodes. Bold V-VLAPS cells highlight gains over VLAPS at the 1800s budget.

Why doesn't the value head help at 600s? A failure-mode analysis reveals that most failures at 600s are root-level MCTS timeouts — episodes where the search never executes a single action before hitting the wall-time cap. Near the root, all candidate branches are far from task completion, so the value head assigns near-zero predictions to every branch with little discrimination. The extended budget gives MCTS time to reach deeper states where branch values separate, and that's where V-VLAPS gains.

Method	Failures	Root timeouts	Fraction
VLAPS	18	14	77.8%
V-VLAPS	15	13	86.7%

Failure modes on LIBERO-Object tasks 6–8 at the 600s budget. Most failures are root-level timeouts where MCTS never executes an action chunk.

Citation

@inproceedings{vvlaps, title = {V-VLAPS: Value-Guided Vision-Language-Action Planning and Search}, author = {Ren, Ke and Salamatian, Ali and Pattison, Kieran and Neary, Cyrus}, booktitle={Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning} }