Please use this identifier to cite or link to this item:
Title: Cost-sensitive web-based information acquisition for record matching
Authors: TAN YEE FAN
Keywords: cost-sensitive acquisition, web resource, record matching, hierarchical resource acquisition framework, resource dependency graph, benefit function
Issue Date: 25-Mar-2011
Citation: TAN YEE FAN (2011-03-25). Cost-sensitive web-based information acquisition for record matching. ScholarBank@NUS Repository.
Abstract: In many record matching problems, the input data is either ambiguous or incomplete, making the record matching task difficult. However, for some domains, evidence for record matching decisions are readily available in large quantities on the Web. These resources may be retrieved by making queries to a search engine, making the Web a valuable resource. On the other hand, Web resources are slow to acquire compared to data that is already available in the input. Also, some Web resources must be acquired before others. Hence, it is necessary to acquire Web resources selectively and judiciously, while satisfying the acquisition dependencies between these resources. This thesis has two major goals: 1. To establish that acquisition of web based resources can benefit the task performance of record matching tasks, and 2. To propose an algorithm for selective acquisition of web based resources for record matching tasks. It should balance acquisition costs and acquisition benefits, while taking acquisition dependencies between resources into account. This thesis has two major parts corresponding to the two goals. In the first part, I propose methods for using information from the Web for three different record matching problems, namely, author name disambiguation, linkage of short forms to long forms, and web people search. Thus, I establish that acquiring web based resources can improve record matching tasks. In the second and larger part, I propose approaches for selective acquisition of web based resources for record matching tasks, with the aim of balancing acquisition costs and acquisition benefits. These approaches start from the more task-specific and move towards the more general and principled. I first propose a way for adaptively combining two methods for record matching, followed by a cost-sensitive attribute value acquisition algorithm for support vector machines. This work culminates in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies, which is the main contribution in this thesis. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account. Finally, I proposed two benefit functions for use in my framework, corresponding to two different evaluation measures.
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
TanYF.pdf2.05 MBAdobe PDF



Page view(s)

checked on Apr 21, 2019


checked on Apr 21, 2019

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.