Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/43759
Title: Unsupervised Structure Induction for Natural Language Processing
Authors: HUANG YUN
Keywords: natural language processing, grammar induction, unsupervised learning, Bayesian
Issue Date: 28-Mar-2013
Citation: HUANG YUN (2013-03-28). Unsupervised Structure Induction for Natural Language Processing. ScholarBank@NUS Repository.
Abstract: Many Natural Language Processing (NLP) tasks involve some kind of structure analysis, such as word alignment for machine translation, syntactic parsing for coreference resolution, semantic parsing for question answering, etc. Traditional supervised learning methods rely on manually labeled structures for training. Unfortunately, manual annotations are often expensive and time-consuming for large amounts of rich text. It has great value to induce structures automatically from unannotated sentences for NLP research. In this thesis, I first introduce and analyze the existing methods in structure induction, then present our explorations on three unsupervised structure induction tasks: the transliteration equivalence learning, the constituency grammar induction and the dependency grammar induction. In transliteration equivalence learning, transliterated bilingual word pairs are given without internal syllable alignments. The task is to automatically infer the mapping between syllables in source and target languages. This dissertation addresses problems of the state-of-the-art grapheme-based joint source-channel model, and proposes Synchronous Adaptor Grammar (SAG), a novel nonparametric Bayesian learning approach for machine transliteration. This model provides a general framework to automatically learn syllable equivalents without heuristics or restrictions. The constituency grammar induction is useful since annotated treebanks are only available for a few languages. This dissertation focuses on the effective Constituent-Context Model (CCM) and proposes to enrich this model with linguistic features. The features are defined in log-linear form with local normalization, in which the efficient Expectation-Maximization (EM) algorithm is still applicable. Moreover, we advocate using a separated development set (a.k.a. the validation set) to perform model selection, and measure trained model on an additional test set. Under this framework, we could automatically select suitable model and parameters without setting them manually. Empirical results demonstrate the feature-based model could overcome the data sparsity problem of original CCM and achieve better performance using compact representations. Dependency grammars could model the word-word dependencies which is suitable for other high-level tasks such as relation extraction and coreference resolution. This dissertation investigates Combinatory Categorial Grammar (CCG), an expressive lexicalized grammar formalism which is able to capture long-range dependencies. We introduce boundary part-of-speech (POS) tags into the baseline model (Bisk-Hockenmaier:2012:AAAI) to capture lexical information. For learning, we propose a Bayesian model to learn CCG grammars, and the full EM and k-best EM algorithms are also implemented and compared. Experiments show the boundary model improves the dependency accuracy for all these three learning algorithms. The proposed Bayesian model outperforms the full EM algorithm, but underperforms the k-best EM learning algorithm. In summary, this dissertation investigates unsupervised learning methods including Bayesian learning models and feature-based models, and provides some novel ideas of unsupervised structure induction for natural language processing. The automatically induced structures may help on subsequent NLP applications.
URI: http://scholarbank.nus.edu.sg/handle/10635/43759
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
HuangYun.pdf780.1 kBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.