Repository logo
  • English
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
Repository logo
  • Communities & Collections
  • Research Outputs
  • Projects
  • People
  • Statistics
  • English
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Scholalry Output
  3. Publications
  4. Beyond visual semantics: Exploring the role of scene text in image understanding
 
  • Details
Options

Beyond visual semantics: Exploring the role of scene text in image understanding

ISSN
01678655
Date Issued
2021-09-01
Author(s)
Dey, Arka Ujjal
Ghosh, Suman K.
Valveny, Ernest
Harit, Gaurav
DOI
10.1016/j.patrec.2021.06.011
Abstract
Images with visual and scene text content are ubiquitous in everyday life. However, current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper, we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We not only extract and encode visual and scene text cues but also model their interplay to generate a contextual joint embedding with richer semantics. The contextual embedding thus generated is applied to retrieval and classification tasks on multimedia images with scene text content to demonstrate its effectiveness. In the retrieval framework, we augment the contextual semantic representation with scene text cues to mitigate vocabulary misses that may have occurred during the semantic embedding. To deal with irrelevant or erroneous scene text recognition, we also apply query-based attention to the text channel. We show that our multi-channel approach, involving contextual semantics and scene text, improves upon the absolute accuracy of the current state-of-the-art methods on Advertisement Images Dataset by 8.9% in the relevant statement retrieval task and by 5% in the topic classification task.
Subjects
  • Graph attention

  • Scene text

  • Text and vision

  • Visual semantics

Copyright © 2016-2025  Indian Institute of Technology Jodhpur

Developed and maintained by Dr. Kamlesh Patel and Mr. C. Chhatwani, S. R. Ranganathan Learning Hub, IIT Jodhpur.

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback