w²VLA: Decoupling Declarative from Procedural

w²VLA

Decoupling the Declarative from the Procedural
in Vision-Language-Action Models

Nikolaos Tsagkas¹, Andreas Sochopoulos¹, Chris Xiaoxuan Lu², Oisin Mac Aodha¹, Alexandros Kouris³

ArXiv Code (Comming Soon) Cite

$w^{2}$VLA presentation video.

In-Domain Performance

Seen (Skill, Object) Pairs

All models, including state-of-the-art baselines $\pi_{0.5}$ and OTTER, perform remarkably well when executing skills on the exact objects they were trained on. Select a scenario below to compare model rollouts.

Zero-Shot Skill Transfer

Unseen (Skill, Object) Pairs

When asked to transfer a learned skill to a different object, baseline models fail catastrophically. They recede to imitating the skill that was originally paired with the target object during training. w²VLA successfully isolates procedural intent to transfer the skill.

Overall Model Performance

Breakdown of average performance scores: Results from OTTER, $\pi_{0.5}$, and w²VLA in-domain, i.e., on seen (skill, object) pairs, and on skill transfer. The metrics "object", "skill", and "task" correspond to the success of the models in selecting the correct object, executing the correct skill, and fully completing the task, respectively.

Robustness to OOD Scenarios

1. Robustness to Distractor Objects

We re-evaluated the setup by introducing 3-5 random, visually diverse distractor items to the scene. Thanks to the isolated "where" conditioning powered by VLM localization heatmaps, w²VLA confidently ignores clutter and correctly executes the requested skill.

(Move Back, 🍌)

✅ Success

(Move Back, 🥕)

✅ Success

(Rotate, 🍌)

✅ Success

(Rotate, 🍌)

❌ Failure
Picked mid-air

(Rotate, 🥕)

✅ Success

(Rotate, 🥕)

❌ Failure
Grasping slipped

2. Skill Transfer to Unseen Objects

We challenged the policy to interact with completely novel objects (e.g., Coca-Cola cans, strawberries) that were not present in the demonstration dataset. w²VLA retains high accuracy in object selection and faithfully outputs the commanded behavior.

(Place Bowl, 🥤)

✅ Success

(Place Plate, 🍋)

✅ Success

(Place Bowl, 🥤)

✅ Success

(Place Bowl, 🍓)

✅ Success

(Place Plate, 🥒)

❌ Failure
Failed to pick

(Place Plate, 🥔)

❌ Failure
Missed plate

3. Quantitative Robustness Results

Breakdown of robustness performance scores: (Left) Comparing Scenario 1 baseline against the introduction of visual distractors. (Right) Comparing Scenario 2 baseline against the introduction of completely unseen objects.

BibTeX

Reference

@article{TBA,
    
    
}