Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/107392
Title: On Repairing Structural Issues in Semi-Structured Documents
Authors: YING SHANSHAN
Keywords: XML, data cleaning
Issue Date: 3-Jun-2014
Source: YING SHANSHAN (2014-06-03). On Repairing Structural Issues in Semi-Structured Documents. ScholarBank@NUS Repository.
Abstract: Poor quality of data can have a substantial social and economic impact. Al- though data quality management is a well-established research area, the vast majority of prior works focus on relational data. Increasingly, semi-structured data, such as XML and JSON, are becoming the de facto standard for a huge variety of data formats and applications. Their exibility and easy-customization contribute to the soaring popularity of semi-structured data, but also serve as signi cant sources of major data quality errors. Well-formedness of structure, a prerequisite for many research works on semi-structured data, is an assumption often does not hold. Many XML documents su er from erroneous structures, such as improper nesting where open- and close-tags are unmatched. Apart from this, tags are possibly organized in an incorrect hierarchy or sequence, leading to unexpected number of occurrence. To enforce the balance of open- and close- tags, we propose in this thesis two algorithms targeting at di erent structural constraints. The rst algorithm focuses on tags only while the second limits the occurrence of text in the doc- ument. Thorough proofs are presented on the completeness and approximation ratio of these algorithms. Besides we concentrate on detecting unexpected el- ement error, when there are missing or spurious elements. We propose novel techniques to detect unexpected element errors and provide plausible reason- ing for every reported error and a summarization technique based on variations of set cover for concise reporting. We demonstrate the e ectiveness of these algorithms on real datasets through extensive experimental study.
URI: http://scholarbank.nus.edu.sg/handle/10635/107392
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
ON REPAIRING STRUCTURAL ISSUES IN SEMI-STRUCTURED DOCUMENTS.pdf1.73 MBAdobe PDF

OPEN

NoneView/Download

Page view(s)

270
checked on Feb 24, 2018

Download(s)

183
checked on Feb 24, 2018

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.