Constructing compressed suffix arrays with large alphabets

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/39713

DC Field	Value
dc.title	Constructing compressed suffix arrays with large alphabets
dc.contributor.author	Hon, W.-K.
dc.contributor.author	Lam, T.-W.
dc.contributor.author	Sadakane, K.
dc.contributor.author	Sung, W.-K.
dc.date.accessioned	2013-07-04T07:47:52Z
dc.date.available	2013-07-04T07:47:52Z
dc.date.issued	2003
dc.identifier.citation	Hon, W.-K.,Lam, T.-W.,Sadakane, K.,Sung, W.-K. (2003). Constructing compressed suffix arrays with large alphabets. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2906 : 240-249. ScholarBank@NUS Repository.
dc.identifier.issn	03029743
dc.identifier.uri	http://scholarbank.nus.edu.sg/handle/10635/39713
dc.description.abstract	Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(\|∑\|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size \|∑\| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
dc.source	Scopus
dc.type	Article
dc.contributor.department	COMPUTER SCIENCE
dc.description.sourcetitle	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.description.volume	2906
dc.description.page	240-249
dc.identifier.isiut	NOT_IN_WOS
Appears in Collections:	Staff Publications

Show simple item record

Files in This Item:

There are no files associated with this item.

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM