Click to Grasp

Abstract

Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation.

BibTeX

@inproceedings{tsagkas2024click, title={Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors}, author={Tsagkas, Nikolaos and Rome, Jack and Ramamoorthy, Subramanian and Mac Aodha, Oisin and Lu, Chris Xiaoxuan}, journal={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year={2024} }

Click to Grasp: Zero-Shot Precise Manipulation
via Visual Diffusion Descriptors

Grasping from the left or the right arm of a stuffed toy.

Grasping from the opening of the left or the right shoe.

Click to Grasp leverages web-trained text-to-image diffusion-based generative models to extract fine-grained part descriptors,
enabling zero-shot synthesis of gripper poses for precise manipulation tasks in diverse tabletop scenarios.

Abstract

Click to Grasp takes calibrated RGB-D images of a tabletop and user-defined part instances in diverse source images as input, and produces gripper poses for interaction, effectively disambiguating between visually similar but semantically different concepts (e.g., left vs right arms).

BibTeX

Click to Grasp: Zero-Shot Precise Manipulationvia Visual Diffusion Descriptors

Grasping from the left or the right arm of a stuffed toy.

Grasping from the opening of the left or the right shoe.

Click to Grasp leverages web-trained text-to-image diffusion-based generative models to extract fine-grained part descriptors, enabling zero-shot synthesis of gripper poses for precise manipulation tasks in diverse tabletop scenarios.

Abstract

Click to Grasp takes calibrated RGB-D images of a tabletop and user-defined part instances in diverse source images as input, and produces gripper poses for interaction, effectively disambiguating between visually similar but semantically different concepts (e.g., left vs right arms).

BibTeX

Click to Grasp: Zero-Shot Precise Manipulation
via Visual Diffusion Descriptors

Click to Grasp leverages web-trained text-to-image diffusion-based generative models to extract fine-grained part descriptors,
enabling zero-shot synthesis of gripper poses for precise manipulation tasks in diverse tabletop scenarios.