Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/107392
DC Field | Value | |
---|---|---|
dc.title | On Repairing Structural Issues in Semi-Structured Documents | |
dc.contributor.author | YING SHANSHAN | |
dc.date.accessioned | 2014-10-31T18:00:47Z | |
dc.date.available | 2014-10-31T18:00:47Z | |
dc.date.issued | 2014-06-03 | |
dc.identifier.citation | YING SHANSHAN (2014-06-03). On Repairing Structural Issues in Semi-Structured Documents. ScholarBank@NUS Repository. | |
dc.identifier.uri | http://scholarbank.nus.edu.sg/handle/10635/107392 | |
dc.description.abstract | Poor quality of data can have a substantial social and economic impact. Al- though data quality management is a well-established research area, the vast majority of prior works focus on relational data. Increasingly, semi-structured data, such as XML and JSON, are becoming the de facto standard for a huge variety of data formats and applications. Their exibility and easy-customization contribute to the soaring popularity of semi-structured data, but also serve as signi cant sources of major data quality errors. Well-formedness of structure, a prerequisite for many research works on semi-structured data, is an assumption often does not hold. Many XML documents su er from erroneous structures, such as improper nesting where open- and close-tags are unmatched. Apart from this, tags are possibly organized in an incorrect hierarchy or sequence, leading to unexpected number of occurrence. To enforce the balance of open- and close- tags, we propose in this thesis two algorithms targeting at di erent structural constraints. The rst algorithm focuses on tags only while the second limits the occurrence of text in the doc- ument. Thorough proofs are presented on the completeness and approximation ratio of these algorithms. Besides we concentrate on detecting unexpected el- ement error, when there are missing or spurious elements. We propose novel techniques to detect unexpected element errors and provide plausible reason- ing for every reported error and a summarization technique based on variations of set cover for concise reporting. We demonstrate the e ectiveness of these algorithms on real datasets through extensive experimental study. | |
dc.language.iso | en | |
dc.subject | XML, data cleaning | |
dc.type | Thesis | |
dc.contributor.department | COMPUTER SCIENCE | |
dc.contributor.supervisor | TUNG KUM HOE, ANTHONY | |
dc.description.degree | Ph.D | |
dc.description.degreeconferred | DOCTOR OF PHILOSOPHY | |
dc.identifier.isiut | NOT_IN_WOS | |
Appears in Collections: | Ph.D Theses (Open) |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
ON REPAIRING STRUCTURAL ISSUES IN SEMI-STRUCTURED DOCUMENTS.pdf | 1.73 MB | Adobe PDF | OPEN | None | View/Download |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.