Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/107392
Title: | On Repairing Structural Issues in Semi-Structured Documents | Authors: | YING SHANSHAN | Keywords: | XML, data cleaning | Issue Date: | 3-Jun-2014 | Citation: | YING SHANSHAN (2014-06-03). On Repairing Structural Issues in Semi-Structured Documents. ScholarBank@NUS Repository. | Abstract: | Poor quality of data can have a substantial social and economic impact. Al- though data quality management is a well-established research area, the vast majority of prior works focus on relational data. Increasingly, semi-structured data, such as XML and JSON, are becoming the de facto standard for a huge variety of data formats and applications. Their exibility and easy-customization contribute to the soaring popularity of semi-structured data, but also serve as signi cant sources of major data quality errors. Well-formedness of structure, a prerequisite for many research works on semi-structured data, is an assumption often does not hold. Many XML documents su er from erroneous structures, such as improper nesting where open- and close-tags are unmatched. Apart from this, tags are possibly organized in an incorrect hierarchy or sequence, leading to unexpected number of occurrence. To enforce the balance of open- and close- tags, we propose in this thesis two algorithms targeting at di erent structural constraints. The rst algorithm focuses on tags only while the second limits the occurrence of text in the doc- ument. Thorough proofs are presented on the completeness and approximation ratio of these algorithms. Besides we concentrate on detecting unexpected el- ement error, when there are missing or spurious elements. We propose novel techniques to detect unexpected element errors and provide plausible reason- ing for every reported error and a summarization technique based on variations of set cover for concise reporting. We demonstrate the e ectiveness of these algorithms on real datasets through extensive experimental study. | URI: | http://scholarbank.nus.edu.sg/handle/10635/107392 |
Appears in Collections: | Ph.D Theses (Open) |
Show full item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
ON REPAIRING STRUCTURAL ISSUES IN SEMI-STRUCTURED DOCUMENTS.pdf | 1.73 MB | Adobe PDF | OPEN | None | View/Download |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.