Critical Tokenization and its Properties | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/129712

DC Field	Value
dc.title	Critical Tokenization and its Properties
dc.contributor.author	Guo, J.
dc.date.accessioned	2016-11-08T08:25:39Z
dc.date.available	2016-11-08T08:25:39Z
dc.date.issued	1997-12
dc.identifier.citation	Guo, J. (1997-12). Critical Tokenization and its Properties. Computational Linguistics 23 (4) : 569-596. ScholarBank@NUS Repository.
dc.identifier.issn	08912017
dc.identifier.uri	http://scholarbank.nus.edu.sg/handle/10635/129712
dc.description.abstract	Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding. The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2) Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization. It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.
dc.source	Scopus
dc.type	Article
dc.contributor.department	INSTITUTE OF SYSTEMS SCIENCE
dc.description.sourcetitle	Computational Linguistics
dc.description.volume	23
dc.description.issue	4
dc.description.page	569-596
dc.identifier.isiut	NOT_IN_WOS
Appears in Collections:	Staff Publications

Show simple item record

Files in This Item:

There are no files associated with this item.

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.