Options
Multimodal query-guided object localization
ISSN
13807501
Date Issued
2024-02-01
Author(s)
Tripathi, Aditay
Dani, Rajath R.
Mishra, Anand
Chakraborty, Anirban
DOI
10.1007/s11042-023-15779-y
Abstract
Recent studies have demonstrated the effectiveness of using hand-drawn sketches of objects as queries for one-shot object localization. However, hand-drawn crude sketches alone can be ambiguous for object localization, which could result in misidentification, e.g., a sketch of a laptop could be confused for a sofa. To overcome this, we propose a novel multimodal approach to object localization that combines sketch queries with linguistic category definitions, allowing for a better representation of visual and semantic cues. Our approach employs a cross-modal attention scheme that guides the region proposal network to obtain relevant proposals. Further, we propose an orthogonal projection-based proposal scoring technique that effectively ranks proposals with respect to the query. We evaluated our method using hand-drawn sketches from the ‘Quick, Draw!’ dataset and glosses from ‘WordNet’ as queries on the widely-used MS-COCO dataset, and achieve superior performance compared to related baselines in both open- and closed-set settings.