Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/129712
DC Field | Value | |
---|---|---|
dc.title | Critical Tokenization and its Properties | |
dc.contributor.author | Guo, J. | |
dc.date.accessioned | 2016-11-08T08:25:39Z | |
dc.date.available | 2016-11-08T08:25:39Z | |
dc.date.issued | 1997-12 | |
dc.identifier.citation | Guo, J. (1997-12). Critical Tokenization and its Properties. Computational Linguistics 23 (4) : 569-596. ScholarBank@NUS Repository. | |
dc.identifier.issn | 08912017 | |
dc.identifier.uri | http://scholarbank.nus.edu.sg/handle/10635/129712 | |
dc.description.abstract | Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding. The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2) Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization. It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined. | |
dc.source | Scopus | |
dc.type | Article | |
dc.contributor.department | INSTITUTE OF SYSTEMS SCIENCE | |
dc.description.sourcetitle | Computational Linguistics | |
dc.description.volume | 23 | |
dc.description.issue | 4 | |
dc.description.page | 569-596 | |
dc.identifier.isiut | NOT_IN_WOS | |
Appears in Collections: | Staff Publications |
Show simple item record
Files in This Item:
There are no files associated with this item.
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.