Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/182357
Title: AUTOMATIC SEGMENTATION FOR MANDARIN SPEECH RECOGNITION
Authors: OU YONGZHEN
Issue Date: 1996
Citation: OU YONGZHEN (1996). AUTOMATIC SEGMENTATION FOR MANDARIN SPEECH RECOGNITION. ScholarBank@NUS Repository.
Abstract: The research on speech recognition is to fulfill the dream of making machines to understand human speech. This is particularly significant for Mandarin speech as the input of Chinese characters to computer is traditionally difficult. Mandarin speech has some unique characteristics, such as its monosyllabic, its Initial-Final structure and its tonality. These features make it possible to recognize all Mandarin syllables by first identifying the Initials, Finals and tones separately and then combining them together. After that the recognized syllables must go through a language processor to be translated into Chinese characters. The Initial-Final recognition approach has many advantages but it relies heavily on the accuracy of segmentation. Current speech segmentation is done manually and there exists a lot of inconsistencies due to the subjective nature of manual segmentation. Moreover, the manual segmentation process is both laborious and tedious and not applicable to real time speech recognition. This thesis investigates the properties of Mandarin speech extensively and proposes a method for automatic endpoint detection and Initial-Final partition of isolated Mandarin syllables. The developed segmentation algorithm uses a combination of the speech features to determine the syllable boundaries. namely the beginning point (BP), the isolating point (IP) and the ending point (EP) of the syllable. The features used include the spectral (melscale coefficients) and its variations; the speech production parameter, the cepstrum; and the waveform parameter, the zero crossing rate. These features capture various aspects of the speech signal and provide useful information for decision making in automatic segmentation. The reliability of the algorithm is evaluated by applying it to three large speech databases produced by three different speakers, two females (ZM and FM) and one male (MM). The three databases have as many as 19,553 Mandarin syllables in total. The accuracy of automatic segmentation is obtained by comparing the auto-segmented boundaries with the manual-segmented ones. Due to the open structure of Mandarin speech and the efficiency of the algorithm. the EP detection achieved very high accuracy, on the order of 96% under 15 ms deviation and about 100% under 25 ms deviation. The averaged accuracy of BP detection over three databases is slightly lower than that of EP detection, with over 93% under 15 ms deviation and over 99% under 25 ms deviation. The IP segmentation accuracy of the ZM database is over 97% under 15 ms deviation and over 98% under 25 ms deviation. Averaging over all three databases. the IP segmentation accuracy is over 88% under 15 ms deviation and over 97% under 25 ms deviation. These results are better than those reported in relevant literature. To further determine the performance of the auto-segmentation algorithm, experiments of syllable group classification and Initial recognition have been conducted using the Time-Delayed Neural Networks (TDNN). The TDNN architecture is used both as a group discriminator and a phoneme classifier. Two architectures (GTDNN-I and GTDNN-Il) for syllable group classification and two architectures (JTDNN-1 and ITDNN-Il) for Initial recognition have been implemented. The experiments are performed using both manually and automatically segmented data over three databases. Both inside and outside test performance are shown by their recognition rates. On the female (ZM) speech database. the recognition rates for syllable group classification (using GTDNN-1) are 94.90% on manual-segmented data and 96.07% on auto-segmented data. The gain is 1.17% (= 96.07% - 94.90%). Considering GTDNN-II. the recognition rates for syllable group classification are 95.91% on manual-segmented data and 95.66% on auto-segmented data. The difference is only 0.26% (= 95.92% - 95.66%). Considering JTDNN-I, the overall Initial recognition rates are 88.61 % on manual-segmented data and 91.01 % on auto-segmented data. The gain is 2.40% (= 91.01 % - 88.61 %). Considering ITDNN-11, the overall Initial recognition rates are 92.70% on manual-segmented data and 94.38% on auto-segmented data. The gain is 1.68% (= 94.38% - 92.70%). These results confirm that the automatic segmentation algorithm works very well and it has a positive effect on the consonant recognition. The good performance can be attributed to the high consistency and high accuracy of the automatic segmentation algorithm.
URI: https://scholarbank.nus.edu.sg/handle/10635/182357
Appears in Collections:Master's Theses (Restricted)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
B20098388.PDF6.33 MBAdobe PDF

RESTRICTED

NoneLog In

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.