Attentive Feature Aggregation, or:
How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

Under Submission to IROS 2026


This page contains the supplementary material for our IROS 2026 submission. The page is divided into two main sections: real-world experiments and simulation experiments. In the real-world section, we present videos from our pick-and-place experiments on the SO-101 and from our smaller scale planar-pushing experiments on the KUKA IIWA. You can use the TOC on the right side of the screen to navigate the page faster (TOC appears only on desktop, full-screen mode).

Real-World Experiments


Pick-and-Place with SO-101

We validated the efficacy of AFA in the real-world across different robot platforms and tasks. Initially, we conducted experiments on the increasingly popular LeRobot SO-101 robot platform, utilizing the provided wrist camera and an additional external ZED2i from a top-down view. The goal of the chosen task was picking a small, blue paper box and placing it inside a cylindrical tin can.

Experimental Setup: The initial position of the box varied, while the container was fixed at a specific location for simplicity. In total, 60 demonstrations were collected without any scene distractors. To emphasize the lasting value of our findings, we trained the policy (with and without AFA) using features collected from the recently released DINOv3. During evaluation, we awarded 0.5 points for successfully picking the object and another 0.5 for placing it in the can. We mark the videos below with emojis to indicate the success of each rollout: βœ… for full success, 😐 for partial success, and ❌ for failure.

In-Domain Evaluation

We evaluated the trained policy In-Domain (ID), conducting a total of 20 rollouts. As shown below, the performance between the baseline (PVR) and our method (PVR+AFA) was comparable in this setting, achieving success rates of 87.5% and 85.0% respectively.

1. w/o AFA In-Domain
2. with AFA In-Domain

Out-of-Distribution (OOD) Evaluation

To evaluate robustness, we introduced visual perturbations by placing 20 everyday items as distractors randomly in the scene at the beginning of each rollout. Even though the ID performance was comparable, not using AFA in the OOD setting led to a catastrophic drop in performance. The success rate of the policy trained without AFA plummeted to 17.5%, whereas utilizing AFA allowed the policy to suffer only a minor performance drop, achieving a success rate of 75%.

It is apparent that AFA manages to shift the attention mostly to task-relevant regions, whereas the attention of the PVR focuses on every semantically rich object in the scene.

3. w.o AFA Out-of-Domain
4. with AFA Out-of-Domain

Planar Pushing with KUKA IIWA


In-Domain Evaluation

Directly leveraging features from Pre-trained Visual Representations (PVRs) proves effective when deploying the trained policy in in-domain scenes (i.e., those that closely match the expert demonstrations used during behavior cloning).

PVR (Baseline)
PVR + AFA (Ours)

Out-of-Domain Evaluation

We tested the policies against Scene Distractors, Lighting Perturbations, and Object Modifications. Below we group the results into rollouts with and without AFA.

w/o AFA Out-of-Domain

with AFA (Ours)

Simulation Experiments


Expert Demonstrations

Below we present the 10 tasks selected from Metaworld that we train visuo-motor policies to solve in our experiments. We extract 25 expert demonstrations from each, with random initial conditions (e.g., position of objects).

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


Scenes Under Visual Perturbations

Below, we present the same 10 tasks but with visually perturbed scenes. First, we introduce random variations in the tabletop texture, often incorporating patterns that may serve as semantic visual distractors (e.g., flowers, numbers, etc.). Second, we modify the lighting conditions by adjusting brightness, position, and orientation, which affects most pixels in the scene (e.g., by casting new shadows).

Tabletop Texture

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


Light Brightness, Position, Orientation

Assembly

Coffee Pull

Button Press

Pick and Place

Bin Picking

Drawer Open

Peg Insert

Shelf Place

Disassemble

Box Close


PVR vs PVR+AFA Under Visual Perturbations

Below, we visualise out-of-domain evaluations of trained policies with and without AFA. As is evident, when using AFA the policy becomes much more robust to visual perturbations.

Evaluation Under Tabletop Texture Changes

PVR

PVR + AFA (Ours)

Evaluation Under Light Brightness, Position, Orientation Changes

PVR

PVR + AFA (Ours)


Interpreting AFA from Attention Heatmaps

The following videos illustrate how the CLS token attends to different patches compared to the trained query token of AFA, offering insight into what each model prioritizes. While this does not imply that trained AFAs are entirely robust to visual changes in the scene (e.g., note that MAE+AFA still allocates some attention to patches containing the robot’s cast shadow), we observe a consistent trend: the attention heatmaps become more focused, particularly on regions relevant to the task.

DINO: objects

MAE: light

VC-1: texture

DINO + AFA: objects

MAE + AFA: light

VC-1 + AFA: texture