When Pre-trained Visual Representations Fall Short:
Limitations in Visuo-Motor Robot Learning

Abstract

The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. We demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.

Real-World Experiments


In-Domain Evaluation

Directly leveraging features from Pre-trained Visual Representations (PVRs) proves effective when deploying the trained policy in in-domain scenes (i.e., those that closely match the expert demonstrations used during behavior cloning).

This approach fails under scene perturbations. We introduce the AFA module, which learns to attend to task-relevant visual cues and improves the policy’s robustness.


Scene Distractors

We modified the scene by adding distractors near the goal, specifically selecting items rich in semantic content that pose strong challenges for PVRs, pre-trained with real-world images (e.g., stuffed toys and groceries).

Scene #1

Scene #2

Scene #3

Scene #4


Light Perturbations

We changed the lighting conditions, both by dimming the ceiling lights and including a light source near the goal. The reflective surface made this visual change even more pronounced.

Scene #1: Dimmed Lights & Desk Lamp

Scene #2: Lights Off & Desk Lamp

Scene #3: Lights Off


Visually Modifying the Cube

We added a distractor to the cube itself, thus visually modifying the object of interaction. This example is considerably more difficult since the emergency button we placed on top of the cube obstructs important cues (e.g., part of the green tape on the cube), due to the camera position and orientation.

Simulation Experiments


Expert Demonstrations

Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


Scenes Under Visual Perturbations

Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).

Tabletop Texture

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


Light Brightness, Position, Orientation

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


PVR vs PVR+AFA Under Visual Perturbations

Below, we visualise out-of-domain evaluations of trained policies with and without AFA. As is evident, when using AFA the policy becomes much more robust to visual perturbations.

Evaluation Under Tabletop Texture Changes

PVR

PVR + AFA (Ours)

Evaluation Under Light Brightness, Position, Orientation Changes

PVR

PVR + AFA (Ours)


Interpreting AFA from Attention Heatmaps

The following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering insight into what each model prioritizes. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robot’s cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.

DINO: objects

MAE: light

VC-1: texture

DINO + AFA: objects

MAE + AFA: light

VC-1 + AFA: texture

Citation

@article{tsagkas2025pretrainedvisualrepresentationsfall,
    title={When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning},
    author={Nikolaos Tsagkas and Andreas Sochopoulos and Duolikun Danier and Chris Xiaoxuan Lu and Oisin Mac Aodha},
    journal={arXiv preprint arXiv:2502.03270},
    year={2025},
    }