Textual-Visual Relevancy
To enable novel object generalization across objects, colors, and shapes, we propose textual-visual relevancy maps to distill information from large vision and language models. We extract textual relevancy for open-vocabulary recognition and construct visual relevancy via data retrieval to improve robustness.