Options
Mishra, Anand
Loading...
Preferred name
Mishra, Anand
Alternative Name
Mishra, A.
Main Affiliation
Web Site
ORCID
Scopus Author ID
35475490000
Researcher ID
ABA-3300-2021
Now showing 1 - 9 of 9
- PublicationQuery-guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch(2024)
;Aditay Tripathi; Anirban ChakrabortyIn this study, we explore sketch-based object localization on natural images. Given a crude hand-drawn object sketch, the task is to locate all instances of that object in the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap between the sketches and the natural images. Existing solutions address this using attention-based frameworks to merge query information into image features. Yet, these methods often integrate query features after independently learning image features, causing inadequate alignment and as a result incorrect localization. In contrast, we propose a novel sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features, leading to stronger alignment with the query sketch. Further, at the decoder's output, object and sketch features are refined better to align the representation of objects with the sketch query, thereby improving localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by the proposed model are query-aware. Our framework can utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images from the public benchmark, MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a 6.6% and 8.0% improvement in mAP for seen objects using sketch queries from QuickDraw! and Sketchy datasets, respectively, and a 12.2% improvement in AP@50 for large objects that are 'unseen' during training. The code is available at https://vcl-iisc.github.io/locformer/. - PublicationSketch-guided Image Inpainting with Partial Discrete Diffusion Process(2024)
;Nakul Sharma ;Aditay Tripathi ;Anirban ChakrabortyIn this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs - the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images,thereby,enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available here: https://github.com/vl2g/Sketch-Inpainting. - PublicationSemantic Labels-Aware Transformer Model for Searching over a Large Collection of Lecture-Slides(2024)
;K. V. Jobin; C. V. JawaharMassive Open Online Courses (MOOCs) enable easy access to many educational materials, particularly lecture slides, on the web. Searching through them based on user queries becomes an essential problem due to the availability of such vast information. To address this, we present Lecture Slide Deck Search Engine - a model that supports natural language queries and hand-drawn sketches and performs searches on a large collection of slide images on computer science topics. This search engine is trained using a novel semantic label-aware transformer model that extracts the semantic labels in the slide images and seamlessly encodes them with the visual cues from the slide images and textual cues from the natural language query. Further, to study the problem in a challenging setting, we introduce a novel dataset, namely the Lecture Slide Deck (LecSD) Dataset containing 54K slide images from the Data Structure, Computer Networks, and Optimization courses and provide associated manual annotation for the query in the form of natural language or hand-drawn sketch. The proposed Lecture Slide Deck Search Engine outperforms the competitive baselines and achieves nearly 4% superior Recall@1 on an absolute scale compared to the state-of-the-art approach. We firmly believe that this work will open up promising directions for improving the accessibility and usability of educational resources, enabling students and educators to find and utilize lecture materials more effectively. - PublicationFrom strings to things: Knowledge-enabled VQA model that can read and reason(2019-10-01)
;Singh, Ajeet Kumar; ;Shekhar, ShashankChakraborty, AnirbanText present in images are not merely strings, they provide useful cues about the image. Despite their utility in better image understanding, scene texts are not used in traditional visual question answering (VQA) models. In this work, we present a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer. Our proposed model has three mutually interacting modules: I. proposal module to get word and visual content proposals from the image, ii. fusion module to fuse these proposals, question and knowledge base to mine relevant facts, and represent these facts as multi-relational graph, iii. reasoning module to perform a novel gated graph neural network based reasoning on this graph. The performance of our knowledge-enabled VQA model is evaluated on our newly introduced dataset, viz. text-KVQA. To the best of our knowledge, this is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning. Through extensive experiments, we show that our proposed method outperforms traditional VQA as well as question-answering over knowledge base-based methods on text-KVQA. - PublicationBridging language to visuals: towards natural language query-to-chart image retrieval(2024)
;Neelu Verma ;Anik DeGiven a natural language query, mining a relevant chart image, i.e., the one that contains the answer to the query, is an overlooked problem in the literature. Our study explores this novel problem. Consider an example of retrieving relevant chart images for a query: Which Indian city has the highest annual rainfall over the past decade?. Retrieving relevant chart images for such natural language queries necessitates a deep semantic understanding of chart images. Towards addressing this problem, in this work, we make two key contributions: (a) We present a dataset, namely WebCIRD (or Web Chart Image Retrieval) for studying this problem, and (b) propose a solution viz. ChartSemBERT that offers a deeper semantic understanding of chart images for effective natural language-to-chart image retrieval. Our proposed approach yields remarkable performance improvements compared to the existing baselines, achieving R@10 as 86.9%. - PublicationComposite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions(2024)
;Prajwal Gatti ;Kshitij Parikh ;Dhriti Prasanna Paul ;Manish GuptaNon-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for 'numbats.' Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ∼2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir. Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). - PublicationMultimodal query-guided object localization(2023)
;Aditay Tripathi ;Rajath R Dani; Anirban ChakrabortyRecent studies have demonstrated the effectiveness of using hand-drawn sketches of objects as queries for one-shot object localization. However, hand-drawn crude sketches alone can be ambiguous for object localization, which could result in misidentification, e.g., a sketch of a laptop could be confused for a sofa. To overcome this, we propose a novel multimodal approach to object localization that combines sketch queries with linguistic category definitions, allowing for a better representation of visual and semantic cues. Our approach employs a cross-modal attention scheme that guides the region proposal network to obtain relevant proposals. Further, we propose an orthogonal projection-based proposal scoring technique that effectively ranks proposals with respect to the query. We evaluated our method using hand-drawn sketches from the ‘Quick, Draw!’ dataset and glosses from ‘WordNet’ as queries on the widely-used MS-COCO dataset, and achieve superior performance compared to related baselines in both open- and closed-set settings. - PublicationPatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures(2025-04)
;Shreya Shukla ;Nakul Sharma ;Manish GuptaWriting comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PATENTDESC-355K, a novel large-scale dataset containing ∼355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PATENTLMM - a novel large multimodal model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PATENTLMM comprises two key components: (i) PATENTMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PATENTLLAMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Our extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PATENTDESC-355K and PATENTLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. Copyright © 2025, Association for the Advancement of Artificia Intelligence (www.aaai.org). All rights reserved. - PublicationQDETRv: Query-Guided DETR for One-Shot Object Localization in Videos(2024)
;Yogesh Kumar ;Saswat Mallick; ;Sowmya Rasipuram ;Anutosh MaitraRoshni RamnaniIn this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target video using a single query image of the object. Toward addressing this challenging problem, we extend a popular and successful object detection method, namely DETR (Detection Transformer), and introduce a novel approach - query-guided detection transformer for videos (QDETRv). A distinctive feature of QDETRv is its capacity to exploit information from the query image and spatio-temporal context of the target video, which significantly aids in precisely pinpointing the desired object in the video. We incorporate cross-attention mechanisms that capture temporal relationships across adjacent frames to handle the dynamic context in videos effectively. Further, to ensure strong initialization for QDETRv, we also introduce a novel unsupervised pretraining technique tailored to videos. This involves training our model on synthetic object trajectories with an analogous objective as the query-guided localization task. During this pretraining phase, we incorporate recurrent object queries and loss functions that encourage accurate patch feature reconstruction. These additions enable better temporal understanding and robust representation learning. Our experiments show that the proposed model significantly outperforms the competitive baselines on two public benchmarks, VidOR and ImageNet-VidVRD, extended for one-shot open-set localization tasks.