Please use this identifier to cite or link to this item: https://doi.org/10.1109/ACCESS.2018.2879642
DC FieldValue
dc.titleA Fine-Grained Spatial-Temporal Attention Model for Video Captioning
dc.contributor.authorLiu, A.-A.
dc.contributor.authorQiu, Y.
dc.contributor.authorWong, Y.
dc.contributor.authorSu, Y.-T.
dc.contributor.authorKankanhalli, M.
dc.date.accessioned2021-12-29T04:42:46Z
dc.date.available2021-12-29T04:42:46Z
dc.date.issued2018
dc.identifier.citationLiu, A.-A., Qiu, Y., Wong, Y., Su, Y.-T., Kankanhalli, M. (2018). A Fine-Grained Spatial-Temporal Attention Model for Video Captioning. IEEE Access 6 : 68463-68471. ScholarBank@NUS Repository. https://doi.org/10.1109/ACCESS.2018.2879642
dc.identifier.issn21693536
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/212406
dc.description.abstractAttention mechanism has been extensively used in video captioning tasks, which enables further development of deeper visual understanding. However, most existing video captioning methods apply the attention mechanism on the frame level, which only model the temporal structure and generated words, but ignore the region-level spatial information that provides accurate visual features corresponding to the semantic content. In this paper, we propose a fine-grained spatial-temporal attention model (FSTA), and the spatial information of objects appearing in the video will be our main concern. In the proposed FSTA, we achieve the spatial-hard attention at a fine-grained region level of objects through the mask pooling module and compute the temporal soft attention by using a two-layer LSTM network with attention mechanism to generate sentences. We test the proposed model on two benchmark datasets, namely, MSVD and MSR-VTT. The results indicate that our proposed FSTA model can achieve competitive performance against the state of the arts on both datasets. © 2013 IEEE.
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.sourceScopus OA2018
dc.subjectFine-grained
dc.subjectmask pooling
dc.subjectspatial-temporal
dc.subjectvideo captioning
dc.typeArticle
dc.contributor.departmentSMART SYSTEMS INSTITUTE
dc.contributor.departmentDEPT OF COMPUTER SCIENCE
dc.description.doi10.1109/ACCESS.2018.2879642
dc.description.sourcetitleIEEE Access
dc.description.volume6
dc.description.page68463-68471
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
10_1109_ACCESS_2018_2879642.pdf13.29 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons