Please use this identifier to cite or link to this item:
|Title:||Lip geometric features for human-computer interaction using bimodal speech recognition: Comparison and analysis||Authors:||Kaynak, M.N.
|Issue Date:||Jun-2004||Citation:||Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Jian, Z., Chung, K.C. (2004-06). Lip geometric features for human-computer interaction using bimodal speech recognition: Comparison and analysis. Speech Communication 43 (1-2) : 1-16. ScholarBank@NUS Repository. https://doi.org/10.1016/j.specom.2004.01.003||Abstract:||Bimodal speech recognition is a novel extension of acoustic speech recognition for which both acoustic and visual speech information are used to improve the recognition accuracy in noisy environments. Although various bimodal speech systems have been developed, a rigorous and detailed comparison of the possible geometric visual features from speakers' faces has not been given yet in the previous papers. Thus, in this paper, the geometric visual features are compared and analyzed rigorously for their importance in bimodal speech recognition. The relevant information of each possible single visual feature is used to determine the best combination of geometric visual features for both visual-only and bimodal speech recognition. From the geometric visual features analyzed, lip vertical aperture is the most relevant; and the set formed by the vertical and horizontal lip apertures and the first order derivative of the lip corner angle gives the best results among the possibilities of reduced set of geometric features that were analyzed. Also, in this paper, the effect of the modelling parameters of hidden Markov models (HMM) on each single geometric lip feature's recognition accuracy is analyzed. Finally, the accuracy of acoustic-only, visual-only, and bimodal speech recognition methods are experimentally determined and compared using the optimized HMMs and geometric visual features. Compared to acoustic and visual-only speech recognition, the bimodal speech recognition scheme has a much improved recognition accuracy using the geometric visual features, especially in the presence of noise. The results obtained showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal to noise ratio (SNR) of 0 dB). © 2004 Elsevier B.V. All rights reserved.||Source Title:||Speech Communication||URI:||http://scholarbank.nus.edu.sg/handle/10635/56502||ISSN:||01676393||DOI:||10.1016/j.specom.2004.01.003|
|Appears in Collections:||Staff Publications|
Show full item record
Files in This Item:
There are no files associated with this item.
checked on Jan 16, 2020
WEB OF SCIENCETM
checked on Jan 9, 2020
checked on Dec 30, 2019
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.