Please use this identifier to cite or link to this item:
Title: Splitting-Merging Model of Chinese Word Tokenization and Segmentation
Authors: Yao, Y. 
Lua, K.T. 
Issue Date: 1998
Citation: Yao, Y., Lua, K.T. (1998). Splitting-Merging Model of Chinese Word Tokenization and Segmentation. Natural Language Engineering 4 (4) : 309-324. ScholarBank@NUS Repository.
Abstract: Word tokenization & segmentation in natural language processing of languages like Chinese, which have no blank space for word delimitation, are considered. Three major problems are faced: (1) tokenizing direction & efficiency, (2) insufficient tokenization dictionary & nonwords, & (3) ambiguity of tokenization & segmentation. Most existing tokenization & segmentation methods have not dealt with the above problems together. A novel dictionary-based method called the splitting-merging model for Chinese word tokenization & segmentation is presented. It uses the mutual information of Chinese characters to find the boundaries & the non-boundaries of Chinese words, & finally leads to word segmentation by resolving ambiguities & detecting new words.
Source Title: Natural Language Engineering
ISSN: 13513249
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.