The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. Our experiments demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.
Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).
Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).
The following videos present how policies that have been trained on raw PVR features fail to perform well when visual perturbations are introduced in the scene. In contrast, policies trained with our PVR+AFA are more robust to these changes, as they can focus on the important parts of the scene for the task.
Robustness evaluation of policies trained with PVR-extracted features temporally augmented with TE, and with features processed using AFA and temporally augmented with TE. The evaluation is conducted under perturbed lighting conditions and tabletop texture changes. Subplot (a) shows the average performance per task, while subplot (b) shows the average performance per model.
These following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering a general sense of what the models prioritize. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robot’s cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.
@article{tsagkas2025pretrainedvisualrepresentationsfall,
title={When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning},
author={Nikolaos Tsagkas and Andreas Sochopoulos and Duolikun Danier and Chris Xiaoxuan Lu and Oisin Mac Aodha},
journal={arxiv preprint arXiv:2502.03270},
year={2025},
}