CHINESE SENTENCE TOKENIZATION | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/181985

Title:	CHINESE SENTENCE TOKENIZATION
Authors:	JIN GUO
Issue Date:	1997
Citation:	JIN GUO (1997). CHINESE SENTENCE TOKENIZATION. ScholarBank@NUS Repository.
Abstract:	This thesis is to answer what a token and what sentence tokenization is all about. Our vision is that what required for understanding lexical tokens and sentence tokenization is not a descriptive theory full of tiny details induced from facts but an explanatory theory made of principles and parameters from which all facts can be deduced. Our mission is thus to discover such principles, articulate them in great precision and to validate them on authenticate data. Guided by insights gained from the rich literature on sentence tokenization research, we have been exploring along two unpopular routes. One is human experimentation under abnormal conditions, such as tokenizing random token strings and reading-aloud unfamiliar text at a fast pace. The other is mathematical investigation with focus on information representation and transformation, such as the notion of critical fragments and critical tokenization. The best understanding that we have reached today can be summarized as follows. Sentence tokenization is an autonomous information-processing module in the overall human language-processing system. It takes character string as input and produces as output streams of critical tokenization sets by solely working on the three principles: dictionary completeness, critical tokenization and one-tokenization-per-source. For a given input, it works by establishing and transforming descriptions under different information representation schemes including critical points and fragments, sets of critical tokenization, and critical and hidden ambiguities. Critical fragmentation, critical tokenization and tokenization recollection are the three major information processes for establishing and transforming representational descriptions from the given input to the desired output. Moreover, it is believed that the sentence tokenization module is innate and common to all human beings. In contrast, tokenization lexicon is personal and acquired from individual's language exposure. A language expression is added to the lexicon as a token only if it has been validated that such an addition will not introduce any violation to the principle of one-tokenization-per-source. While the interaction between sentence tokenization and other language-processing modules is extremely limited to ensure efficiency required by real-time execution, that for lexicon acquisition is immense because an improper addition will significantly hinder the person's communication effectiveness. This understanding has been enlightened by and validated on a large Chinese corpus (the PH corpus) uniquely developed as a test-bed for extensive linguistic investigation onto the sentence tokenization problem. Information extracted from this manually tokenized corpus has been used in solving various sentence tokenization sub-problems. In the dual course, we also have devised the two-stage five-step iterative problem solving strategy for Chinese sentence tokenization and developed a comprehensive set of strategies and algorithms, including - The optimal algorithm for unambiguous token boundary identification; - The optimal algorithm for critical tokenization generation; - The strategy of tokenization by memorization; and the strategy of tokenization by random selection; - The strategy of hidden ambiguity detection by critical tokenization; - The context-centred methodology for both hidden ambiguity resolution and unregistered token determination All the algorithms developed have been demonstrated with significantly better performance as compared with similar works in the literature. The theoretical understanding, the unique tokenization corpus, and the set of algorithms and strategies together constitutes a new paradigm for Chinese sentence tokenization research, which we regard as the major contribution of the thesis.
URI:	https://scholarbank.nus.edu.sg/handle/10635/181985
Appears in Collections:	Ph.D Theses (Restricted)

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
B20838074.PDF		12.46 MB	Adobe PDF	RESTRICTED	None	Log In

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.