NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Please use this identifier to cite or link to this item: https://doi.org/10.1109/ACCESS.2020.3019084

DC Field	Value
dc.title	NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks
dc.contributor.author	Cheuk, K.W.
dc.contributor.author	Anderson, H.
dc.contributor.author	Agres, K.
dc.contributor.author	Herremans, D.
dc.date.accessioned	2021-08-24T02:38:26Z
dc.date.available	2021-08-24T02:38:26Z
dc.date.issued	2020
dc.identifier.citation	Cheuk, K.W., Anderson, H., Agres, K., Herremans, D. (2020). NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks. IEEE Access 8 : 161981-162003. ScholarBank@NUS Repository. https://doi.org/10.1109/ACCESS.2020.3019084
dc.identifier.issn	2169-3536
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/198955
dc.description.abstract	In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed. @ 2013 IEEE.
dc.publisher	Institute of Electrical and Electronics Engineers Inc.
dc.source	Scopus OA2020
dc.subject	constant Q transform
dc.subject	Convolution
dc.subject	CQT
dc.subject	discrete Fourier transform
dc.subject	GPU
dc.subject	library
dc.subject	Mel Spectrogram
dc.subject	PyTorch
dc.subject	short time Fourier transform
dc.subject	signal processing
dc.subject	spectrogram
dc.type	Article
dc.contributor.department	YONG SIEW TOH CONSERVATORY OF MUSIC
dc.description.doi	10.1109/ACCESS.2020.3019084
dc.description.sourcetitle	IEEE Access
dc.description.volume	8
dc.description.page	161981-162003
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_1109_ACCESS_2020_3019084.pdf		10.45 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM