Please use this identifier to cite or link to this item:
Title: Splitting-Merging Model of Chinese Word Tokenization and Segmentation
Authors: Yao, Y. 
Lua, K.T. 
Issue Date: 1998
Source: Yao, Y., Lua, K.T. (1998). Splitting-Merging Model of Chinese Word Tokenization and Segmentation. Natural Language Engineering 4 (4) : 309-324. ScholarBank@NUS Repository.
Abstract: Word tokenization & segmentation in natural language processing of languages like Chinese, which have no blank space for word delimitation, are considered. Three major problems are faced: (1) tokenizing direction & efficiency, (2) insufficient tokenization dictionary & nonwords, & (3) ambiguity of tokenization & segmentation. Most existing tokenization & segmentation methods have not dealt with the above problems together. A novel dictionary-based method called the splitting-merging model for Chinese word tokenization & segmentation is presented. It uses the mutual information of Chinese characters to find the boundaries & the non-boundaries of Chinese words, & finally leads to word segmentation by resolving ambiguities & detecting new words.
Source Title: Natural Language Engineering
ISSN: 13513249
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Page view(s)

checked on Jan 13, 2018

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.