Click to Grasp: Zero-Shot Precise Manipulation
via Visual Diffusion Descriptors

1University of Edinburgh, 2Edinburgh Centre for Robotics, 3UCL

Grasping from the left or the right arm of a stuffed toy.

Grasping from the opening of the left or the right shoe.

Click to Grasp leverages web-trained text-to-image diffusion-based generative models to extract fine-grained part descriptors,
enabling zero-shot synthesis of gripper poses for precise manipulation tasks in diverse tabletop scenarios.


Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation.

Click to Grasp takes calibrated RGB-D images of a tabletop and user-defined part instances in diverse source images as input, and produces gripper poses for interaction, effectively disambiguating between visually similar but semantically different concepts (e.g., left vs right arms).


      title={Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors},
      author={Tsagkas, Nikolaos and Rome, Jack and Ramamoorthy, Subramanian and Mac Aodha, Oisin and Lu, Chris Xiaoxuan},
      journal={arXiv preprint arXiv:2403.14526},