Publication

Bed-tree: An all-purpose index structure for string similarity search based on edit distance

Zhang, Z.
Hadjieleftheriou, M.
Ooi, B.C.
Srivastava, D.
Citations
Altmetric:
Alternative Title
Abstract
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time. © 2010 ACM.
Keywords
approximate join, edit distance, range query, similarity search, string, top-k query
Source Title
Proceedings of the ACM SIGMOD International Conference on Management of Data
Publisher
Series/Report No.
Organizational Units
Organizational Unit
COMPUTER SCIENCE
dept
Rights
Date
2010
DOI
10.1145/1807167.1807266
Type
Conference Paper
Related Datasets
Related Publications