Please use this identifier to cite or link to this item:
https://doi.org/10.1109/TMM.2022.3204444
DC Field | Value | |
---|---|---|
dc.title | Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision | |
dc.contributor.author | Wang, X | |
dc.contributor.author | Zhu, L | |
dc.contributor.author | Zheng, Z | |
dc.contributor.author | Xu, M | |
dc.contributor.author | Yang, Y | |
dc.date.accessioned | 2023-11-14T02:49:31Z | |
dc.date.available | 2023-11-14T02:49:31Z | |
dc.date.issued | 2022-01-01 | |
dc.identifier.citation | Wang, X, Zhu, L, Zheng, Z, Xu, M, Yang, Y (2022-01-01). Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision. IEEE Transactions on Multimedia : 1-11. ScholarBank@NUS Repository. https://doi.org/10.1109/TMM.2022.3204444 | |
dc.identifier.issn | 1520-9210 | |
dc.identifier.issn | 1941-0077 | |
dc.identifier.uri | https://scholarbank.nus.edu.sg/handle/10635/245914 | |
dc.description.abstract | Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global representation between videos and text descriptions and utilize the global contrastive loss to train the model. These designs overlook the local alignment and the word-level supervision signal. In this paper, we propose a new framework, called Align and Tell, for text-video retrieval. Compared to the previous work, our framework contains additional modules, i.e., two transformer decoders for local alignment and one captioning head to enhance the representation learning. First, we introduce a set of learnable queries to interact with both textual representations and video representations and project them to a fixed number of local features. After that, local contrastive learning is performed to complement the global comparison. Moreover, we design a video captioning head to provide additional supervision signals during training. This word-level supervision can enhance the visual presentation and alleviate the cross-modal gap. The captioning head can be removed during inference and does not introduce extra computational costs. Extensive empirical results demonstrate that our Align and Tell model can achieve state-of-the-art performance on four text-video retrieval datasets, including MSR-VTT, MSVD, LSMDC, and ActivityNet-Captions. | |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) | |
dc.source | Elements | |
dc.type | Article | |
dc.date.updated | 2023-11-11T05:14:26Z | |
dc.contributor.department | DEPARTMENT OF COMPUTER SCIENCE | |
dc.description.doi | 10.1109/TMM.2022.3204444 | |
dc.description.sourcetitle | IEEE Transactions on Multimedia | |
dc.description.page | 1-11 | |
dc.published.state | Published | |
Appears in Collections: | Staff Publications Elements |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
TMM22-Xiaohan.pdf | Accepted version | 4.14 MB | Adobe PDF | OPEN | Post-print | View/Download |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.