Please use this identifier to cite or link to this item:
Title: Effective Interpretation, Integration and Querying of Web Tables
Authors: LU MEIYU
Keywords: Web table processing, header identification, schema extraction, schema matching, Web table integration, probabilistic tagging
Issue Date: 2-Aug-2013
Citation: LU MEIYU (2013-08-02). Effective Interpretation, Integration and Querying of Web Tables. ScholarBank@NUS Repository.
Abstract: The World Wide Web contains a vast amount of structured and semi-structured information in the form of HTML tables (a.k.a., Web tables). The rich information embedded in those Web tables provides us an opportunity to build a valuable knowledge base and make it usable and queryable for ordinary users. In this work, we aim to propose and implement a holistic Web table processing framework to explore such knowledge. Our framework consists of three main components: Web table interpretation, integration and querying. Our first work is to present a generic solution to extract the schema (i.e., attribute names and data types) of Web tables. The main challenge arises from the diversity in Web tables, especially in those with complex structure. For instance, the ways to organize tables and present table headers may vary widely across tables. In view of this, we propose a series of machine learning approaches, together with a rich set of header-relevant features, to identify the headers of Web tables. We further transform the Web tables into relational form with several hand-crafted heuristics. Our second work is to discover high-quality schema matches between Web table columns, which is a fundamental problem in data integration. Conventional schema matching techniques are not always effective due to the incompleteness of values and semantic heterogeneity in Web tables. To this end, we propose a concept-based machine-crowdsourcing hybrid framework to effectively discover the matches. To reduce the crowdsourcing cost, matches that are difficult for machine algorithms and that have greater influence on other matches are preferred and would be published for crowdsourcing. Our third work is to develop a convenient query interface for ordinary users to issue queries on the integrated Web tables. Towards this aim, we introduce the idea of probabilistic tagging, where each value in the database is associated with multiple semantic-relevant tags in a probabilistic way. With the enriched tags, users are allowed to issue structured queries using any tag they like, rather than over a predefined mediated schema. An efficient and effective dynamic instantiation scheme is designed to process user-issued queries, where the semantics of queried tags are determined on-the-fly. We validate our proposed approaches via extensive experiments on real-world Web table datasets.
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
LuMeiyu.pdf3.09 MBAdobe PDF



Page view(s)

checked on Mar 22, 2020


checked on Mar 22, 2020

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.