When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning

1University of Edinburgh, 2Edinburgh Centre for Robotics, 3UCL

Abstract

The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. Our experiments demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.

Video Grid

Expert Demonstrations

Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close

Scenes Under Visual Perturbations

Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).

Tabletop Texture

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close

Light Brightness, Position, Orientation

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close

PVR vs PVR+AFA (Ours)

The following videos present how policies that have been trained on raw PVR features fail to perform well when visual perturbations are introduced in the scene. In contrast, policies trained with our PVR+AFA are more robust to these changes, as they can focus on the important parts of the scene for the task.

Comparison Under Light Changes

PVR Failure
AFA Success

Comparison Under Texture Changes

PVR Failure
AFA Success

Robustness evaluation of policies trained with PVR-extracted features temporally augmented with TE, and with features processed using AFA and temporally augmented with TE. The evaluation is conducted under perturbed lighting conditions and tabletop texture changes. Subplot (a) shows the average performance per task, while subplot (b) shows the average performance per model.

Interpretting AFA from Attention Heatmaps

These following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering a general sense of what the models prioritize. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robot’s cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.

DINO: objects

MAE: light

VC-1: texture

DINO + AFA: objects

MAE + AFA: light

VC-1 + AFA: texture

BibTeX

@article{tsagkas2025pretrainedvisualrepresentationsfall,
      title={When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning}, 
      author={Nikolaos Tsagkas and Andreas Sochopoulos and Duolikun Danier and Chris Xiaoxuan Lu and Oisin Mac Aodha},
      journal={arxiv preprint arXiv:2502.03270},
      year={2025},
}