A Fine-Grained Spatial-Temporal Attention Model for Video Captioning

Please use this identifier to cite or link to this item: https://doi.org/10.1109/ACCESS.2018.2879642

DC Field	Value
dc.title	A Fine-Grained Spatial-Temporal Attention Model for Video Captioning
dc.contributor.author	Liu, A.-A.
dc.contributor.author	Qiu, Y.
dc.contributor.author	Wong, Y.
dc.contributor.author	Su, Y.-T.
dc.contributor.author	Kankanhalli, M.
dc.date.accessioned	2021-12-29T04:42:46Z
dc.date.available	2021-12-29T04:42:46Z
dc.date.issued	2018
dc.identifier.citation	Liu, A.-A., Qiu, Y., Wong, Y., Su, Y.-T., Kankanhalli, M. (2018). A Fine-Grained Spatial-Temporal Attention Model for Video Captioning. IEEE Access 6 : 68463-68471. ScholarBank@NUS Repository. https://doi.org/10.1109/ACCESS.2018.2879642
dc.identifier.issn	21693536
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/212406
dc.description.abstract	Attention mechanism has been extensively used in video captioning tasks, which enables further development of deeper visual understanding. However, most existing video captioning methods apply the attention mechanism on the frame level, which only model the temporal structure and generated words, but ignore the region-level spatial information that provides accurate visual features corresponding to the semantic content. In this paper, we propose a fine-grained spatial-temporal attention model (FSTA), and the spatial information of objects appearing in the video will be our main concern. In the proposed FSTA, we achieve the spatial-hard attention at a fine-grained region level of objects through the mask pooling module and compute the temporal soft attention by using a two-layer LSTM network with attention mechanism to generate sentences. We test the proposed model on two benchmark datasets, namely, MSVD and MSR-VTT. The results indicate that our proposed FSTA model can achieve competitive performance against the state of the arts on both datasets. © 2013 IEEE.
dc.publisher	Institute of Electrical and Electronics Engineers Inc.
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source	Scopus OA2018
dc.subject	Fine-grained
dc.subject	mask pooling
dc.subject	spatial-temporal
dc.subject	video captioning
dc.type	Article
dc.contributor.department	SMART SYSTEMS INSTITUTE
dc.contributor.department	DEPARTMENT OF COMPUTER SCIENCE
dc.description.doi	10.1109/ACCESS.2018.2879642
dc.description.sourcetitle	IEEE Access
dc.description.volume	6
dc.description.page	68463-68471
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_1109_ACCESS_2018_2879642.pdf		13.29 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM