When Pre-trained Visual Representations Fall Short:
Limitations in Visuo-Motor Robot Learning

Full Paper EAI@CVPR-25 Paper EAI@CVPR-25 Poster arXiv Code

Update: A short version of this paper has been accepted for oral and poster presentation at the 6th Embodied AI Workshop at CVPR 2025. See you in Nashville, Tennessee! 👋

Abstract

The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. We demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.

Real-World Experiments

In-Domain Evaluation

Directly leveraging features from Pre-trained Visual Representations (PVRs) proves effective when deploying the trained policy in in-domain scenes (i.e., those that closely match the expert demonstrations used during behavior cloning).

This approach fails under scene perturbations. We introduce the AFA module, which learns to attend to task-relevant visual cues and improves the policy’s robustness.

Scene Distractors

We modified the scene by adding distractors near the goal, specifically selecting items rich in semantic content that pose strong challenges for PVRs, pre-trained with real-world images (e.g., stuffed toys and groceries).

Scene #1

Scene #2

Scene #3

Scene #4

Light Perturbations

We changed the lighting conditions, both by dimming the ceiling lights and including a light source near the goal. The reflective surface made this visual change even more pronounced.

Scene #1: Dimmed Lights & Desk Lamp

Scene #2: Lights Off & Desk Lamp

Scene #3: Lights Off

Visually Modifying the Cube

We added a distractor to the cube itself, thus visually modifying the object of interaction. This example is considerably more difficult since the emergency button we placed on top of the cube obstructs important cues (e.g., part of the green tape on the cube), due to the camera position and orientation.

Simulation Experiments

Expert Demonstrations

Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close

Scenes Under Visual Perturbations

Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).