Semi-supervised acoustic model training for speech with code-switching

Please use this identifier to cite or link to this item: https://doi.org/10.1016/j.specom.2018.10.006

DC Field	Value
dc.title	Semi-supervised acoustic model training for speech with code-switching
dc.contributor.author	Yilmaz, E
dc.contributor.author	McLaren, M
dc.contributor.author	Heuvel, HVD
dc.contributor.author	Leeuwen, DAV
dc.date.accessioned	2019-06-04T03:46:00Z
dc.date.available	2019-06-04T03:46:00Z
dc.date.issued	2018-12-01
dc.identifier.citation	Yilmaz, E, McLaren, M, Heuvel, HVD, Leeuwen, DAV (2018-12-01). Semi-supervised acoustic model training for speech with code-switching. Speech Communication 105 : 12-22. ScholarBank@NUS Repository. https://doi.org/10.1016/j.specom.2018.10.006
dc.identifier.issn	0167-6393
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/155142
dc.description.abstract	© 2018 Elsevier B.V. In the FAME! project, we aim to develop an automatic speech recognition (ASR) system for Frisian-Dutch code-switching (CS) speech extracted from the archives of a local broadcaster with the ultimate goal of building a spoken document retrieval system. Unlike Dutch, Frisian is a low-resourced language with a very limited amount of manually annotated speech data. In this paper, we describe several automatic annotation approaches to enable using of a large amount of raw bilingual broadcast data for acoustic model training in a semi-supervised setting. Previously, it has been shown that the best-performing ASR system is obtained by two-stage multilingual deep neural network (DNN) training using 11 hours of manually annotated CS speech (reference) data together with speech data from other high-resourced languages. We compare the quality of transcriptions provided by this bilingual ASR system with several other approaches that use a language recognition system for assigning language labels to raw speech segments at the front-end and using monolingual ASR resources for transcription. We further investigate automatic annotation of the speakers appearing in the raw broadcast data by first labeling with (pseudo) speaker tags using a speaker diarization system and then linking to the known speakers appearing in the reference data using a speaker recognition system. These speaker labels are essential for speaker-adaptive training in the proposed setting. We train acoustic models using the manually and automatically annotated data and run recognition experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic annotations. The ASR and CS detection results demonstrate the potential of using automatic language and speaker tagging in semi-supervised bilingual acoustic model training.
dc.publisher	Elsevier BV
dc.source	Elements
dc.type	Article
dc.date.updated	2019-06-03T06:51:02Z
dc.contributor.department	ELECTRICAL AND COMPUTER ENGINEERING
dc.description.doi	10.1016/j.specom.2018.10.006
dc.description.sourcetitle	Speech Communication
dc.description.volume	105
dc.description.page	12-22
dc.published.state	Published
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
SC2018.pdf	Accepted version	2.48 MB	Adobe PDF	CLOSED	Post-print

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM