Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition

Please use this identifier to cite or link to this item: https://doi.org/10.1109/TMM.2005.846777

DC Field	Value
dc.title	Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition
dc.contributor.author	Lucey S.
dc.contributor.author	Chen T.
dc.contributor.author	Sridharan S.
dc.contributor.author	Chandran V.
dc.date.accessioned	2018-08-21T05:09:47Z
dc.date.available	2018-08-21T05:09:47Z
dc.date.issued	2005
dc.identifier.citation	Lucey S., Chen T., Sridharan S., Chandran V. (2005). Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition. IEEE Transactions on Multimedia 7 (3) : 495-506. ScholarBank@NUS Repository. https://doi.org/10.1109/TMM.2005.846777
dc.identifier.issn	15209210
dc.identifier.uri	http://scholarbank.nus.edu.sg/handle/10635/146310
dc.description.abstract	In this paper, an in-depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well-known hidden Markov model (HMM) classifier framework for modeling speech. A novel framework for modeling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speech processing applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition.
dc.source	Scopus
dc.subject	Audio-visual speech processing (AVSP)
dc.subject	Classifier combination
dc.subject	Integration strategies
dc.subject	Multistream hidden Markov model (HMM)
dc.subject	Speaker recognition
dc.type	Article
dc.contributor.department	OFFICE OF THE PROVOST
dc.contributor.department	DEPARTMENT OF COMPUTER SCIENCE
dc.description.doi	10.1109/TMM.2005.846777
dc.description.sourcetitle	IEEE Transactions on Multimedia
dc.description.volume	7
dc.description.issue	3
dc.description.page	495-506
dc.description.coden	ITMUF
dc.published.state	published
dc.grant.fundingagency	Scopus
Appears in Collections:	Staff Publications

Show simple item record

Files in This Item:

There are no files associated with this item.

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM