Please use this identifier to cite or link to this item: https://doi.org/10.1016/S0306-4379(01)00041-2
DC FieldValue
dc.titleA knowledge-based approach for duplicate elimination in data cleaning
dc.contributor.authorLup Low, W.
dc.contributor.authorLi Lee, M.
dc.contributor.authorWang Ling, T.
dc.date.accessioned2013-07-04T07:33:23Z
dc.date.available2013-07-04T07:33:23Z
dc.date.issued2001
dc.identifier.citationLup Low, W.,Li Lee, M.,Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems 26 (8) : 585-606. ScholarBank@NUS Repository. <a href="https://doi.org/10.1016/S0306-4379(01)00041-2" target="_blank">https://doi.org/10.1016/S0306-4379(01)00041-2</a>
dc.identifier.issn03064379
dc.identifier.urihttp://scholarbank.nus.edu.sg/handle/10635/39075
dc.description.abstractExisting duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more. We propose a new method for computing transitive closure under uncertainty for dealing with the merging of groups of inexact duplicate records and explain why small changes to window sizes has little effect on the results of the sorted neighborhood method. Experimental study with two real-world datasets show that this approach can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving the recall-precision dilemma.
dc.description.urihttp://libproxy1.nus.edu.sg/login?url=http://dx.doi.org/10.1016/S0306-4379(01)00041-2
dc.sourceScopus
dc.subjectData cleaning
dc.subjectDuplicate elimination
dc.subjectKnowledge-based system
dc.typeArticle
dc.contributor.departmentCOMPUTER SCIENCE
dc.description.doi10.1016/S0306-4379(01)00041-2
dc.description.sourcetitleInformation Systems
dc.description.volume26
dc.description.issue8
dc.description.page585-606
dc.description.codenINSYD
dc.identifier.isiutNOT_IN_WOS
Appears in Collections:Staff Publications

Show simple item record
Files in This Item:
There are no files associated with this item.

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.