Efficient and scalable processing of string similarity join | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://doi.org/10.1109/TKDE.2012.195

Title:	Efficient and scalable processing of string similarity join
Authors:	Rong, C. Lu, W. Wang, X. Du, X. Chen, Y. Tung, A.K.H.
Keywords:	MapReduce multiple filtering Similarity join
Issue Date:	2013
Citation:	Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H. (2013). Efficient and scalable processing of string similarity join. IEEE Transactions on Knowledge and Data Engineering 25 (10) : 2217-2230. ScholarBank@NUS Repository. https://doi.org/10.1109/TKDE.2012.195
Abstract:	The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user-specified threshold. Recently, there has been considerable interest in designing new algorithms with the assistant of an inverted index to support efficient string similarity joins. These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: 1) generating candidate pairs by traversing the inverted index; and 2) verifying the candidate pairs by computing the similarity. However, these algorithms either suffer from poor filtering power (which results in high verification cost), or incur too much computational cost to guarantee the filtering power. In this paper, we propose a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly. We also propose a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework. We conduct extensive experiments on both centralized and Hadoop systems using both real and synthetic data sets, and the results show that our proposed approach outperforms existing approaches in both efficiency and scalability. © 1989-2012 IEEE.
Source Title:	IEEE Transactions on Knowledge and Data Engineering
URI:	http://scholarbank.nus.edu.sg/handle/10635/77846
ISSN:	10414347
DOI:	10.1109/TKDE.2012.195
Appears in Collections:	Staff Publications

Show full item record

Files in This Item:

There are no files associated with this item.

Google Scholar^TM

Check

Altmetric

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.