Spatiotemporal interaction residual networks with pseudo3d for video action recognition

Please use this identifier to cite or link to this item: https://doi.org/10.3390/s20113126

DC Field	Value
dc.title	Spatiotemporal interaction residual networks with pseudo3d for video action recognition
dc.contributor.author	Chen, J.
dc.contributor.author	Kong, J.
dc.contributor.author	Sun, H.
dc.contributor.author	Xu, H.
dc.contributor.author	Liu, X.
dc.contributor.author	Lu, Y.
dc.contributor.author	Zheng, C.
dc.date.accessioned	2021-08-10T03:01:32Z
dc.date.available	2021-08-10T03:01:32Z
dc.date.issued	2020
dc.identifier.citation	Chen, J., Kong, J., Sun, H., Xu, H., Liu, X., Lu, Y., Zheng, C. (2020). Spatiotemporal interaction residual networks with pseudo3d for video action recognition. Sensors (Switzerland) 20 (11) : 3126. ScholarBank@NUS Repository. https://doi.org/10.3390/s20113126
dc.identifier.issn	1424-8220
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/196139
dc.description.abstract	Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms. © 2020 by the authors.
dc.publisher	MDPI AG
dc.rights	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.source	Scopus OA2020
dc.subject	Pseudo3D architecture
dc.subject	Spatiotemporal representation learning
dc.subject	Two-branches network
dc.subject	Video action recognition
dc.type	Article
dc.contributor.department	CHEMICAL & BIOMOLECULAR ENGINEERING
dc.description.doi	10.3390/s20113126
dc.description.sourcetitle	Sensors (Switzerland)
dc.description.volume	20
dc.description.issue	11
dc.description.page	3126
Appears in Collections:	Elements Staff Publications

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_3390_s20113126.pdf		2.5 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM