Attentive Feature Aggregation, or:
How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues
Supplementary Material Guide
This page contains the supplementary material for our IROS 2026 submission. The page is divided into two main sections: real-world experiments and simulation experiments. In the real-world section, we present videos from our pick-and-place experiments on the SO-101 and from our smaller scale planar-pushing experiments on the KUKA IIWA. You can use the TOC on the right side of the screen to navigate the page faster (TOC appears only on desktop, full-screen mode).
Real-World Experiments
Pick-and-Place with SO-101
We validated the efficacy of AFA in the real-world across different robot platforms and tasks. Initially, we conducted experiments on the increasingly popular LeRobot SO-101 robot platform, utilizing the provided wrist camera and an additional external ZED2i from a top-down view. The goal of the chosen task was picking a small, blue paper box and placing it inside a cylindrical tin can.
Experimental Setup: The initial position of the box varied, while the container was fixed at a specific location for simplicity. In total, 60 demonstrations were collected without any scene distractors. To emphasize the lasting value of our findings, we trained the policy (with and without AFA) using features collected from the recently released DINOv3. During evaluation, we awarded 0.5 points for successfully picking the object and another 0.5 for placing it in the can. We mark the videos below with emojis to indicate the success of each rollout: β for full success, π for partial success, and β for failure.
In-Domain Evaluation
We evaluated the trained policy In-Domain (ID), conducting a total of 20 rollouts. As shown below, the performance between the baseline (PVR) and our method (PVR+AFA) was comparable in this setting, achieving success rates of 87.5% and 85.0% respectively.
1. w/o AFA In-Domain
2. with AFA In-Domain
Out-of-Distribution (OOD) Evaluation
To evaluate robustness, we introduced visual perturbations by placing 20 everyday items as distractors randomly in the scene at the beginning of each rollout. Even though the ID performance was comparable, not using AFA in the OOD setting led to a catastrophic drop in performance. The success rate of the policy trained without AFA plummeted to 17.5%, whereas utilizing AFA allowed the policy to suffer only a minor performance drop, achieving a success rate of 75%.
It is apparent that AFA manages to shift the attention mostly to task-relevant regions, whereas the attention of the PVR focuses on every semantically rich object in the scene.
3. w.o AFA Out-of-Domain
4. with AFA Out-of-Domain
Planar Pushing with KUKA IIWA
In-Domain Evaluation
Directly leveraging features from Pre-trained Visual Representations (PVRs) proves effective when deploying the trained policy in in-domain scenes (i.e., those that closely match the expert demonstrations used during behavior cloning).
PVR (Baseline)
PVR + AFA (Ours)
Out-of-Domain Evaluation
We tested the policies against Scene Distractors, Lighting Perturbations, and Object Modifications. Below we group the results into rollouts with and without AFA.
w/o AFA Out-of-Domain
with AFA (Ours)
Simulation Experiments
Expert Demonstrations
Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
Scenes Under Visual Perturbations
Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).
Tabletop Texture
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
Light Brightness, Position, Orientation
Assembly
Coffee Pull
Button Press
Pick and Place
Bin Picking
Drawer Open
Peg Insert
Shelf Place
Disassemble
Box Close
PVR vs PVR+AFA Under Visual Perturbations
Below, we visualise out-of-domain evaluations of trained policies with and without AFA. As is evident, when using AFA the policy becomes much more robust to visual perturbations.
Evaluation Under Tabletop Texture Changes
PVR
PVR + AFA (Ours)
Evaluation Under Light Brightness, Position, Orientation Changes
PVR
PVR + AFA (Ours)
Interpreting AFA from Attention Heatmaps
The following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering insight into what each model prioritizes. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robotβs cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.