Please use this identifier to cite or link to this item:
Title: Context-Dependent Acoustic Modelling for Speech Recognition
Keywords: logistic regression, context-dependent modelling, deep neural networks, large vocabulary continuous speech recognition, articulatory features
Issue Date: 6-Jan-2014
Source: WANG GUANGSEN (2014-01-06). Context-Dependent Acoustic Modelling for Speech Recognition. ScholarBank@NUS Repository.
Abstract: Context-dependent (CD) acoustic modelling is widely used in the state-of-the-art large vocabulary continuous speech recognition (LVCSR) systems to address the co-articulation effect in continuous speech. Typically, a CD phone is defined using the neighbouring contexts. The number of CD phone units grows exponentially with the length of the context. In addition, a considerable number of CD phone units have limited numbers of occurrences, or are even unseen in the training corpus. To address this data sparsity problem, parameter sharing/tying is widely adopted. However, this solution introduces another problem: all the CD states in the same cluster share the same set of parameters, making them indistinguishable during decoding. This problem is referred to as the ?clustering? problem. Deep neural networks have been found to outperform the conventional discriminatively trained Gaussian mixture models (GMMs) on a variety of speech recognition benchmarks, which has led to a resurgence of interest in acoustic modelling with NNs, especially DNNs. This thesis is devoted to the CD modelling of the hybrid (D)NN/HMM systems. The first part of the thesis focuses on the hybrid NN/HMM systems with a shallow NN structure, in which only one or two hidden layers are used. The CD state probabilities are obtained from a product-of-expert (PoE) based probability factorisation scheme within the canonical state modelling (CSM) framework. The PoE framework comprises a context-independent (CI) NN followed by a set of two-layer CD-NNs. The canonical states are produced by the CI-NN and the CD-NNs are regarded as the transformations of the canonical state posteriors. The CD state probabilities are computed as the product of the canonical state posteriors and the CD-NN posteriors. Based on the insights obtained from the shallow NN, the major part of the thesis emphasises the hybrid CD-DNN/HMM systems by proposing a novel logistic regression framework. The data sparsity problem is addressed by using the decision tree state clusters as the training targets in the standard CD-DNN/HMM systems. However, the clustering problem is not explicitly addressed in the current literature. In this work, we formulate the CD-DNN as an instance of the CSM technique based on a set of broad phone classes to address both the data sparsity and the clustering problems. The triphone is clustered into multiple sets of shorter biphones using broad phone contexts to address the data sparsity issue. A DNN is trained to discriminate the biphones within each set. The canonical states are represented by the concatenated log posterior probabilities of all the broad phone DNNs. Logistic regression is used to transform the canonical states into the triphone state output probability. Clustering of the regression parameters is used to reduce model complexity while still achieving unique acoustic scores for all possible triphones. Based on some approximations, the regression model can be regarded as a sparse two-layer NN with dynamically connected weights, and its parameters can be learned by optimising the cross-entropy criterion. The experimental results from a broadcast news transcription task reveal that the proposed regression-based CD-DNN significantly outperforms the standard CD-DNN. The best system provides a 1.3% absolute word error rate reduction compared with the best standard CD-DNN system.
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
Thesis_final_Guangsen.pdf4.33 MBAdobe PDF



Page view(s)

checked on Feb 23, 2018


checked on Feb 23, 2018

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.