When Pre-trained Visual Representations Fall Short:
Limitations in Visuo-Motor Robot Learning
Abstract
The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. We demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.
Real-World Experiments
In-Domain Evaluation
Directly leveraging features from Pre-trained Visual Representations (PVRs) proves effective when deploying the trained policy in in-domain scenes (i.e., those that closely match the expert demonstrations used during behavior cloning).
This approach fails under scene perturbations. We introduce the AFA module, which learns to attend to task-relevant visual cues and improves the policy’s robustness.
Scene Distractors
We modified the scene by adding distractors near the goal, specifically selecting items rich in semantic content that pose strong challenges for PVRs, pre-trained with real-world images (e.g., stuffed toys and groceries).
Scene #1
Scene #2
Scene #3
Scene #4
Light Perturbations
We changed the lighting conditions, both by dimming the ceiling lights and including a light source near the goal. The reflective surface made this visual change even more pronounced.
Scene #1: Dimmed Lights & Desk Lamp
Scene #2: Lights Off & Desk Lamp
Scene #3: Lights Off
Visually Modifying the Cube
We added a distractor to the cube itself, thus visually modifying the object of interaction. This example is considerably more difficult since the emergency button we placed on top of the cube obstructs important cues (e.g., part of the green tape on the cube), due to the camera position and orientation.
Simulation Experiments
Expert Demonstrations
Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
Scenes Under Visual Perturbations
Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).
Tabletop Texture
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
Light Brightness, Position, Orientation
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
PVR vs PVR+AFA Under Visual Perturbations
Below, we visualise out-of-domain evaluations of trained policies with and without AFA. As is evident, when using AFA the policy becomes much more robust to visual perturbations.
Evaluation Under Tabletop Texture Changes
PVR
PVR + AFA (Ours)
Evaluation Under Light Brightness, Position, Orientation Changes
PVR
PVR + AFA (Ours)
Interpreting AFA from Attention Heatmaps
The following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering insight into what each model prioritizes. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robot’s cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.
DINO: objects
MAE: light
VC-1: texture
DINO + AFA: objects
MAE + AFA: light
VC-1 + AFA: texture
Citation
@article{tsagkas2025pretrainedvisualrepresentationsfall, title={When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning}, author={Nikolaos Tsagkas and Andreas Sochopoulos and Duolikun Danier and Chris Xiaoxuan Lu and Oisin Mac Aodha}, journal={arXiv preprint arXiv:2502.03270}, year={2025}, }
Updated April 2025