Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/27950
Title: FROM USER-GENERATED-CONTENT TO STRUCTURED KNOWLEDGE EXPLORING MULTI-ASPECT SENTENCE REPRESENTATION AND PROTOTYPE HIERARCHY BASED CATEGORIZATION FOR ORGANIZATION OF TEXT COLLECTIONS
Authors: MING ZHAOYAN
Keywords: Hierarchical Categorization, User-Generated-Content, Clustering Criteria, Sentence Representation, Natural Language Processing
Issue Date: 17-Mar-2011
Source: MING ZHAOYAN (2011-03-17). FROM USER-GENERATED-CONTENT TO STRUCTURED KNOWLEDGE EXPLORING MULTI-ASPECT SENTENCE REPRESENTATION AND PROTOTYPE HIERARCHY BASED CATEGORIZATION FOR ORGANIZATION OF TEXT COLLECTIONS. ScholarBank@NUS Repository.
Abstract: With user contributed services flourishing, social media is becoming popular as new venues for users to interact with one another for their information and social needs. As a result, large amounts of data are produced in the form of user-generated-contents. The expectations are that the more the data, the more information is available and the more knowledge is shared between the users. However, the problem of information overload, uneven quality, and the evolving nature of contents makes the acquisition of information and knowledge extremely di±cult for average users. To make contents more accessible, search functions are provided for users to find contents that are relevant to their queries. However, search functions may be insu±cient when users do not know what to ask or how to issue a proper query. Therefore, an overview of a topic rather than a few isolated retrieval results is preferred. To overcome the above problems, we propose to automatically organize and present the unstructured data in a form that facilitates information access and storage. This thesis aims to organize the unstructured data into meaningful information and knowledge by exploring the representations of each data point and their relations through knowledge-assisted hierarchical clustering. The contributions of the thesis are two-fold: First, the basic unit of sentence is represented from the aspects of lexical importance, lexical semantic gap reduction, and syntactic relation. This multi-aspect representation well captures the similarity between a pair of sentences and passages and articles. Second, a novel prototype hierarchy-based categorization (PHC) framework is proposed for the organization of data collection on a given topic. The framework simultaneously solves the problem of categorizing the data collection and interpreting the clustering results for navigation. By utilizing prototype hierarchies and the underlying topic structures of the collections, PHC is modeled as a multi-criterion optimization problem based on minimizing the hierarchy evolution, maximizing category cohesiveness and inter-hierarchy structural and semantic resemblance. The extensible design of metrics enables PHC to be a general framework for applications in various domains. Experiments conducted on two community question answering archives and two Open Directory Project collections demonstrate that the proposed multi-aspect sentence similarity metric and prototype hierarchy based categorization produce promising results and outperform the current state-of-the-art unsupervised data organization models significantly. The proposed organization model is also applied on a real world application, AutoFAQ, which compiles hierarchically organized FAQs for a given topic from community question answering archives.
URI: http://scholarbank.nus.edu.sg/handle/10635/27950
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
MingZhaoyan.pdf6.24 MBAdobe PDF

OPEN

NoneView/Download

Page view(s)

212
checked on Dec 11, 2017

Download(s)

26
checked on Dec 11, 2017

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.