Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://doi.org/10.1186/s13634-015-0300-4

Title:	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
Authors:	Xiao, X Zhao, S Ha Nguyen, D.H Zhong, X Jones, D.L Chng, E.S Li, H
Keywords:	Beamforming Cost functions Mapping Mathematical transformations Reverberation Signal to noise ratio Speech Speech enhancement Automatic speech recognition Deep neural networks Dynamic features Feature adaptation Least square estimation Log likelihood ratio Robust speech recognition Speech dereverberation Speech recognition
Issue Date:	2016
Citation:	Xiao, X, Zhao, S, Ha Nguyen, D.H, Zhong, X, Jones, D.L, Chng, E.S, Li, H (2016). Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. Eurasip Journal on Advances in Signal Processing 2016 (1) : 1-18. ScholarBank@NUS Repository. https://doi.org/10.1186/s13634-015-0300-4
Rights:	Attribution 4.0 International
Abstract:	This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models. @ 2016, Xiao et al.
Source Title:	Eurasip Journal on Advances in Signal Processing
URI:	https://scholarbank.nus.edu.sg/handle/10635/183307
ISSN:	16876172
DOI:	10.1186/s13634-015-0300-4
Rights:	Attribution 4.0 International
Appears in Collections:	Staff Publications Elements

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_1186_s13634-015-0300-4.pdf		1.23 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License