On Repairing Structural Issues in Semi-Structured Documents | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/107392

DC Field	Value
dc.title	On Repairing Structural Issues in Semi-Structured Documents
dc.contributor.author	YING SHANSHAN
dc.date.accessioned	2014-10-31T18:00:47Z
dc.date.available	2014-10-31T18:00:47Z
dc.date.issued	2014-06-03
dc.identifier.citation	YING SHANSHAN (2014-06-03). On Repairing Structural Issues in Semi-Structured Documents. ScholarBank@NUS Repository.
dc.identifier.uri	http://scholarbank.nus.edu.sg/handle/10635/107392
dc.description.abstract	Poor quality of data can have a substantial social and economic impact. Al- though data quality management is a well-established research area, the vast majority of prior works focus on relational data. Increasingly, semi-structured data, such as XML and JSON, are becoming the de facto standard for a huge variety of data formats and applications. Their exibility and easy-customization contribute to the soaring popularity of semi-structured data, but also serve as signi cant sources of major data quality errors. Well-formedness of structure, a prerequisite for many research works on semi-structured data, is an assumption often does not hold. Many XML documents su er from erroneous structures, such as improper nesting where open- and close-tags are unmatched. Apart from this, tags are possibly organized in an incorrect hierarchy or sequence, leading to unexpected number of occurrence. To enforce the balance of open- and close- tags, we propose in this thesis two algorithms targeting at di erent structural constraints. The rst algorithm focuses on tags only while the second limits the occurrence of text in the doc- ument. Thorough proofs are presented on the completeness and approximation ratio of these algorithms. Besides we concentrate on detecting unexpected el- ement error, when there are missing or spurious elements. We propose novel techniques to detect unexpected element errors and provide plausible reason- ing for every reported error and a summarization technique based on variations of set cover for concise reporting. We demonstrate the e ectiveness of these algorithms on real datasets through extensive experimental study.
dc.language.iso	en
dc.subject	XML, data cleaning
dc.type	Thesis
dc.contributor.department	COMPUTER SCIENCE
dc.contributor.supervisor	TUNG KUM HOE, ANTHONY
dc.description.degree	Ph.D
dc.description.degreeconferred	DOCTOR OF PHILOSOPHY
dc.identifier.isiut	NOT_IN_WOS
Appears in Collections:	Ph.D Theses (Open)

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
ON REPAIRING STRUCTURAL ISSUES IN SEMI-STRUCTURED DOCUMENTS.pdf		1.73 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.