w²VLA
Decoupling the Declarative from the Procedural
in Vision-Language-Action Models

$w^{2}$VLA presentation video.

In-Domain Performance
Seen (Skill, Object) Pairs

All models, including state-of-the-art baselines $\pi_{0.5}$ and OTTER, perform remarkably well when executing skills on the exact objects they were trained on. Select a scenario below to compare model rollouts.

Zero-Shot Skill Transfer
Unseen (Skill, Object) Pairs

When asked to transfer a learned skill to a different object, baseline models fail catastrophically. They recede to imitating the skill that was originally paired with the target object during training. w²VLA successfully isolates procedural intent to transfer the skill.

Overall Model Performance

Breakdown of average performance scores: Results from OTTER, $\pi_{0.5}$, and w²VLA in-domain, i.e., on seen (skill, object) pairs, and on skill transfer. The metrics "object", "skill", and "task" correspond to the success of the models in selecting the correct object, executing the correct skill, and fully completing the task, respectively.

Robustness to OOD Scenarios
1. Robustness to Distractor Objects

We re-evaluated the setup by introducing 3-5 random, visually diverse distractor items to the scene. Thanks to the isolated "where" conditioning powered by VLM localization heatmaps, w²VLA confidently ignores clutter and correctly executes the requested skill.

2. Skill Transfer to Unseen Objects

We challenged the policy to interact with completely novel objects (e.g., Coca-Cola cans, strawberries) that were not present in the demonstration dataset. w²VLA retains high accuracy in object selection and faithfully outputs the commanded behavior.

3. Quantitative Robustness Results

Breakdown of robustness performance scores: (Left) Comparing Scenario 1 baseline against the introduction of visual distractors. (Right) Comparing Scenario 2 baseline against the introduction of completely unseen objects.

BibTeX
Reference
@article{TBA,
    
    
}