Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/41558
Title: Dynamic similarity for fields with NULL values
Authors: Zhao, L.
Yuan, S.S. 
Yang, Q.X.
Peng, S.
Issue Date: 2002
Source: Zhao, L.,Yuan, S.S.,Yang, Q.X.,Peng, S. (2002). Dynamic similarity for fields with NULL values. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2454 LNCS : 161-169. ScholarBank@NUS Repository.
Abstract: One of the most important tasks in data cleansing is to de-duplicate records, which needs to compare records to determine their equivalence. However, existing comparison methods, such as Record Similarity, Equational Theory, implicitly assume that the values in all fields are known, and NULL values are treated as empty strings, which will result in a loss of correct duplicate records. In this paper, we solve this problem by proposing a simple yet efficient method, Dynamic Similarity, which dynamically adjusts the similarity for field with NULL value. Performance results on real and synthetic datasets show that Dynamic Similarity method can achieve more correct duplicate records and without introducing more false positives as compared with Record Similarity. Furthermore, the percentage of correct duplicate records obtained by Dynamic Similarity but not obtained by Record Similarity will increase if the number of fields with NULL values increases. © 2002 Springer-Verlag Berlin Heidelberg.
Source Title: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
URI: http://scholarbank.nus.edu.sg/handle/10635/41558
ISBN: 3540441239
ISSN: 03029743
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Page view(s)

66
checked on Dec 16, 2017

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.