Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/107392
DC FieldValue
dc.titleOn Repairing Structural Issues in Semi-Structured Documents
dc.contributor.authorYING SHANSHAN
dc.date.accessioned2014-10-31T18:00:47Z
dc.date.available2014-10-31T18:00:47Z
dc.date.issued2014-06-03
dc.identifier.citationYING SHANSHAN (2014-06-03). On Repairing Structural Issues in Semi-Structured Documents. ScholarBank@NUS Repository.
dc.identifier.urihttp://scholarbank.nus.edu.sg/handle/10635/107392
dc.description.abstractPoor quality of data can have a substantial social and economic impact. Al- though data quality management is a well-established research area, the vast majority of prior works focus on relational data. Increasingly, semi-structured data, such as XML and JSON, are becoming the de facto standard for a huge variety of data formats and applications. Their exibility and easy-customization contribute to the soaring popularity of semi-structured data, but also serve as signi cant sources of major data quality errors. Well-formedness of structure, a prerequisite for many research works on semi-structured data, is an assumption often does not hold. Many XML documents su er from erroneous structures, such as improper nesting where open- and close-tags are unmatched. Apart from this, tags are possibly organized in an incorrect hierarchy or sequence, leading to unexpected number of occurrence. To enforce the balance of open- and close- tags, we propose in this thesis two algorithms targeting at di erent structural constraints. The rst algorithm focuses on tags only while the second limits the occurrence of text in the doc- ument. Thorough proofs are presented on the completeness and approximation ratio of these algorithms. Besides we concentrate on detecting unexpected el- ement error, when there are missing or spurious elements. We propose novel techniques to detect unexpected element errors and provide plausible reason- ing for every reported error and a summarization technique based on variations of set cover for concise reporting. We demonstrate the e ectiveness of these algorithms on real datasets through extensive experimental study.
dc.language.isoen
dc.subjectXML, data cleaning
dc.typeThesis
dc.contributor.departmentCOMPUTER SCIENCE
dc.contributor.supervisorTUNG KUM HOE, ANTHONY
dc.description.degreePh.D
dc.description.degreeconferredDOCTOR OF PHILOSOPHY
dc.identifier.isiutNOT_IN_WOS
Appears in Collections:Ph.D Theses (Open)

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
ON REPAIRING STRUCTURAL ISSUES IN SEMI-STRUCTURED DOCUMENTS.pdf1.73 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.