Please use this identifier to cite or link to this item:
https://doi.org/10.1145/3240508.3240631
DC Field | Value | |
---|---|---|
dc.title | Learning and fusing multimodal deep features for acoustic scene categorization | |
dc.contributor.author | Yin, Y | |
dc.contributor.author | Shah, RR | |
dc.contributor.author | Zimmermann, R | |
dc.date.accessioned | 2021-09-20T07:47:30Z | |
dc.date.available | 2021-09-20T07:47:30Z | |
dc.date.issued | 2018-10-15 | |
dc.identifier.citation | Yin, Y, Shah, RR, Zimmermann, R (2018-10-15). Learning and fusing multimodal deep features for acoustic scene categorization. MM '18: ACM Multimedia Conference : 1892-1900. ScholarBank@NUS Repository. https://doi.org/10.1145/3240508.3240631 | |
dc.identifier.isbn | 9781450356657 | |
dc.identifier.uri | https://scholarbank.nus.edu.sg/handle/10635/200727 | |
dc.description.abstract | Convolutional Neural Networks (CNNs) have been widely applied to audio classification recently where promising results have been obtained. Previous CNN-based systems mostly learn from two-dimensional time-frequency representations such as MFCC and spectrograms, which may tend to emphasize more on the background noise of the scene. To learn the key acoustic events, we introduce a three-dimensional CNN to emphasize on the different spectral characteristics from neighboring regions in spatial-temporal domain. A novel acoustic scene classification system based on multimodal deep feature fusion is proposed in this paper, where three CNNs have been presented to perform 1D raw waveform modeling, 2D time-frequency image modeling, and 3D spatial-temporal dynamics modeling, respectively. The learnt features are shown to be highly complementary to each other, which are next combined in a feature fusion network to obtain significantly improved classification predictions. Comprehensive experiments have been conducted on two large-scale acoustic scene datasets, namely the DCASE16 dataset and the LITIS Rouen dataset. Experimental results demonstrate the effectiveness of our proposed approach, as our solution achieves state-of-the-art classification rates and improves the average classification accuracy by 1.5% ∼ 8.2% compared to the top ranked systems in the DCASE16 challenge. | |
dc.publisher | ACM | |
dc.source | Elements | |
dc.type | Conference Paper | |
dc.date.updated | 2021-09-19T15:22:28Z | |
dc.contributor.department | CHEMICAL & BIOMOLECULAR ENGINEERING | |
dc.description.doi | 10.1145/3240508.3240631 | |
dc.description.sourcetitle | MM '18: ACM Multimedia Conference | |
dc.description.page | 1892-1900 | |
dc.published.state | Published | |
Appears in Collections: | Staff Publications Elements |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
main.pdf | 1.48 MB | Adobe PDF | CLOSED | None |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.