Please use this identifier to cite or link to this item:
Title: Word segmentation and recognition for Web document framework
Authors: Chi, Chi-Hung 
Ding, Chen 
Lim, Andrew 
Issue Date: 1999
Citation: Chi, Chi-Hung,Ding, Chen,Lim, Andrew (1999). Word segmentation and recognition for Web document framework. International Conference on Information and Knowledge Management, Proceedings : 458-465. ScholarBank@NUS Repository.
Abstract: It is observed that a better approach to Web information understanding is to base on its document framework, which is mainly consisted of (i) the title and the URL name of the page, (ii) the titles and the URL names of the Web pages that it points to, (iii) the alternative information source for the embedded Web objects, and (iv) its linkage to other Web pages of the same document. Investigation reveals that a high percentage of words inside the document framework are `compound words' which cannot be understood by ordinary dictionaries. They might be abbreviations or acronyms, or concatenations of several (partial) words. To recover the content hierarchy of Web documents, we propose a new word segmentation and recognition mechanism to understand the information derived from the Web document framework. A maximal bi-directional matching algorithm with heuristic rules is used to resolve ambiguous segmentation and meaning in compound words. An adaptive training process is further employed to build a dictionary of recognizable abbreviations and acronyms. Empirical results show that over 75% of the compound words found in the Web document framework can be understood by our mechanism. With the training process, the success rate of recognizing compound words can be increased to about 90%.
Source Title: International Conference on Information and Knowledge Management, Proceedings
ISBN: 1581131461
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Page view(s)

checked on Sep 9, 2019

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.