Domain adaptation and training data acquisition in wide-coverage word sense disambiguation and its application to information retrieval | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/38828

Title:	Domain adaptation and training data acquisition in wide-coverage word sense disambiguation and its application to information retrieval
Authors:	ZHONG ZHI
Keywords:	Word Sense Disambiguation, WSD, Information Retrieval
Issue Date:	14-Aug-2012
Citation:	ZHONG ZHI (2012-08-14). Domain adaptation and training data acquisition in wide-coverage word sense disambiguation and its application to information retrieval. ScholarBank@NUS Repository.
Abstract:	Word Sense Disambiguation (WSD) is the process of identifying the meaning of an ambiguous word in context. It is considered a fundamental task in Natural Language Processing (NLP). Previous research shows that supervised approaches achieve state-of-the-art accuracy for WSD. However, the performance of the supervised approaches is affected by several factors, such as domain mismatch and the lack of sense-annotated training examples. As an intermediate component, WSD has the potential of benefiting many other NLP tasks, such as machine translation and information retrieval (IR). But few WSD systems are integrated as a component of other applications. We release an open source supervised WSD system, IMS (It Makes Sense). In the evaluation on lexical-sample tasks of several languages and English all-words tasks of SensEval workshops, IMS achieves state-of-the-art results. It provides a flexible platform to integrate various feature types and different machine learning methods, and can be used as an all-words WSD component with good performance for other applications. To address the domain adaptation problem in WSD, we apply the feature augmentation technique to WSD. By further combining the feature augmentation technique with active learning, we greatly reduce the annotation effort required when adapting a WSD system to a new domain. One bottleneck of supervised WSD systems is the lack of sense-annotated training examples. We propose an approach to extract sense annotated examples from parallel corpora without extra human efforts. Our evaluation shows that the incorporation of the extracted examples achieves better results than just using the manually annotated examples. Previous research arrives at conflicting conclusions on whether WSD systems can improve information retrieval performance. We propose a novel method to estimate the sense distribution of words in short queries. Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations. Our experimental results on standard TREC collections show that using the word senses tagged by our supervised WSD system, we obtain statistically significant improvements over a state-of-the-art IR system.
URI:	http://scholarbank.nus.edu.sg/handle/10635/38828
Appears in Collections:	Ph.D Theses (Open)

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
thesis.pdf		899.81 kB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.