Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping

1Brown University
2Northeastern University
*Equal contribution

Grounded Equivariant Manipulation (GEM), a robust yet efficient approach that leverages pre-trained vision-language models with equivariant language mapping for language-conditioned manipulation tasks

Abstract

Controlling robots through natural language is pivotal for enhancing human-robot collaboration and synthesizing complex robot behaviors. Recent works that are trained on large robot datasets show impressive generalization abilities. However, such pretrained methods are (1) often fragile to unseen scenarios, and (2) expensive to adapt to new tasks. This paper introduces Grounded Equivariant Manipulation (GEM), a robust yet efficient approach that leverages pre-trained vision-language models with equivariant language mapping for language-conditioned manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and generalization ability across diverse tasks in both simulation and the real world. GEM achieves similar or higher performance with orders of magnitude fewer robot data compared with major data-efficient baselines such as CLIPort and VIMA. Finally, our approach demonstrates greater robustness compared to large VLA model, e.g, OpenVLA, at correctly interpreting natural language commands on unseen objects and poses. Code, data, and training details is available

Textual-Visual Relevancy

To enable novel object generalization across objects, colors, and shapes, we propose textual-visual relevancy maps to distill information from large vision and language models. We extract textual relevancy for open-vocabulary recognition and construct visual relevancy via data retrieval to improve robustness.

Equivariant Language Mapping

Grounded Equivariant Manipulation (GEM) exploits domain symmetries that exist in the language-conditioned manipulation problem, specifically, equivariance in SE(2). For example, consider ``grasp the coffee mug by its handle.'' If there is a rotation or translation of the mug, the desired pick action should also transform accordingly, i.e., equivariantly. Our learned policy incorporates such language-conditioned symmetries and generalizes to unseen scenarios related by symmetric transformations, thus making the model generalize to unseen object poses in a few-shot manner.

Mobile Manipulation

We also evaluate GEM on a Spot robot for language-conditioned pick and place tasks.

BibTeX

@inproceedings{jia2025gem,
  title={Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping},
  author={Jia, Mingxi and Huang, Haojie and Zhang, Zhewen and Wang, Chenghao and Zhao, Linfeng and Wang, Dian and Liu, Jason Xinyu and Walters, Robin and Platt, Robert and Tellex, Stefanie},
  booktitle={Robotics and Automation Letters (RAL)},
  year={2025},
  organization={IEEE}
}