Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/181985
DC Field | Value | |
---|---|---|
dc.title | CHINESE SENTENCE TOKENIZATION | |
dc.contributor.author | JIN GUO | |
dc.date.accessioned | 2020-10-29T06:34:34Z | |
dc.date.available | 2020-10-29T06:34:34Z | |
dc.date.issued | 1997 | |
dc.identifier.citation | JIN GUO (1997). CHINESE SENTENCE TOKENIZATION. ScholarBank@NUS Repository. | |
dc.identifier.uri | https://scholarbank.nus.edu.sg/handle/10635/181985 | |
dc.description.abstract | This thesis is to answer what a token and what sentence tokenization is all about. Our vision is that what required for understanding lexical tokens and sentence tokenization is not a descriptive theory full of tiny details induced from facts but an explanatory theory made of principles and parameters from which all facts can be deduced. Our mission is thus to discover such principles, articulate them in great precision and to validate them on authenticate data. Guided by insights gained from the rich literature on sentence tokenization research, we have been exploring along two unpopular routes. One is human experimentation under abnormal conditions, such as tokenizing random token strings and reading-aloud unfamiliar text at a fast pace. The other is mathematical investigation with focus on information representation and transformation, such as the notion of critical fragments and critical tokenization. The best understanding that we have reached today can be summarized as follows. Sentence tokenization is an autonomous information-processing module in the overall human language-processing system. It takes character string as input and produces as output streams of critical tokenization sets by solely working on the three principles: dictionary completeness, critical tokenization and one-tokenization-per-source. For a given input, it works by establishing and transforming descriptions under different information representation schemes including critical points and fragments, sets of critical tokenization, and critical and hidden ambiguities. Critical fragmentation, critical tokenization and tokenization recollection are the three major information processes for establishing and transforming representational descriptions from the given input to the desired output. Moreover, it is believed that the sentence tokenization module is innate and common to all human beings. In contrast, tokenization lexicon is personal and acquired from individual's language exposure. A language expression is added to the lexicon as a token only if it has been validated that such an addition will not introduce any violation to the principle of one-tokenization-per-source. While the interaction between sentence tokenization and other language-processing modules is extremely limited to ensure efficiency required by real-time execution, that for lexicon acquisition is immense because an improper addition will significantly hinder the person's communication effectiveness. This understanding has been enlightened by and validated on a large Chinese corpus (the PH corpus) uniquely developed as a test-bed for extensive linguistic investigation onto the sentence tokenization problem. Information extracted from this manually tokenized corpus has been used in solving various sentence tokenization sub-problems. In the dual course, we also have devised the two-stage five-step iterative problem solving strategy for Chinese sentence tokenization and developed a comprehensive set of strategies and algorithms, including - The optimal algorithm for unambiguous token boundary identification; - The optimal algorithm for critical tokenization generation; - The strategy of tokenization by memorization; and the strategy of tokenization by random selection; - The strategy of hidden ambiguity detection by critical tokenization; - The context-centred methodology for both hidden ambiguity resolution and unregistered token determination All the algorithms developed have been demonstrated with significantly better performance as compared with similar works in the literature. The theoretical understanding, the unique tokenization corpus, and the set of algorithms and strategies together constitutes a new paradigm for Chinese sentence tokenization research, which we regard as the major contribution of the thesis. | |
dc.source | CCK BATCHLOAD 20201023 | |
dc.type | Thesis | |
dc.contributor.department | INFORMATION SYSTEMS & COMPUTER SCIENCE | |
dc.contributor.supervisor | LUI HO CHUNG | |
dc.description.degree | Ph.D | |
dc.description.degreeconferred | DOCTOR OF PHILOSOPHY | |
Appears in Collections: | Ph.D Theses (Restricted) |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
B20838074.PDF | 12.46 MB | Adobe PDF | RESTRICTED | None | Log In |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.