Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/40563
Title: A fast filtering scheme for large database cleansing
Authors: Sung, S.Y. 
Li, Z.
Sun, P.
Keywords: Data cleansing
Duplicate elimination
Filtering scheme
Similarity
Issue Date: 2002
Citation: Sung, S.Y.,Li, Z.,Sun, P. (2002). A fast filtering scheme for large database cleansing. International Conference on Information and Knowledge Management, Proceedings : 76-83. ScholarBank@NUS Repository.
Abstract: Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method. TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods.
Source Title: International Conference on Information and Knowledge Management, Proceedings
URI: http://scholarbank.nus.edu.sg/handle/10635/40563
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.