Please use this identifier to cite or link to this item:
Title: Correlation-based methods for data cleaning, with application to biological databases
Keywords: data cleaning, correlation mining, biological data, data artifacts, duplicate detection, outlier detection
Issue Date: 25-Sep-2007
Citation: KOH LIE YONG (2007-09-25). Correlation-based methods for data cleaning, with application to biological databases. ScholarBank@NUS Repository.
Abstract: Data cleaning aims at improving data quality through detecting and eliminating data artifacts that hamper the efficacy of analysis or data mining. Despite the importance, data cleaning remains neglected in certain knowledge-driven domains such as Bioinformatics. An in-depth study of real-world biological databases indicates that the biological data quality problem is multi-factorial and requires a number of different data cleaning approaches. Current data cleaning approaches that derive observations of data artifacts from the attribute values are inadequate. This thesis exploits the correlations patterns between attributes to provide additional information of the relationships embedded within a data set for data cleaning. We propose three novel correlation-based data cleaning methods to detect outliers and duplicates, and apply them to biological databases as proof-of-concepts. Experimental results show the effectiveness of these correlation-based data cleaning methods in detecting data artifacts that existing approaches fall short of addressing.
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
KOHJLY.pdf1.92 MBAdobe PDF



Page view(s)

checked on Apr 20, 2019


checked on Apr 20, 2019

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.