View/hide extracted text (may have errors)
Inducing Suffix and LCP Arrays in External Memory
Timo Bingmann*, Johannes Fischer†, and Vitaly Osipov‡
KIT, Institute of Theoretical Informatics, 76131 Karlsruhe, Germany
{timo.bingmann,johannes.fischer,osipov}@kit.edu
Abstract rithm (DC3 for short) by K¨arkk¨ainen et al. [21], which
We consider text index construction in external memory has quickly become a showcase string algorithm and
(EM). Our first contribution is an inducing algorithm for is now being taught in many computer science classes
suffix arrays in external memory. Practical tests show around the world, is reported to be 3–4 times slower
that this outperforms the previous best EM suffix sorter than the best superlinear solutions, even with very care-
[Dementiev et al., ALENEX 2005] by a factor of about two ful implementations [29].
in time and I/O-volume. Our second contribution is to This situation changed when in 2009 Nong et al.
augment the first algorithm to also construct the array of [27, we cite more recent journal versions whenever
longest common prefixes (LCPs). This yields the first EM possible] presented another extremely elegant linear
construction algorithm for LCP arrays. The overhead in time algorithm called SAIS that was also fast in practice
time and I/O volume for this extended algorithm over plain (based on the induced sorting principle [17])! Despite
suffix array construction is roughly two. Our algorithms being almost in-place and faster than (or almost as
scale far beyond problem sizes previously considered in the fast as) all previous algorithms on all practical inputs,
literature (text size of 80 GiB using only 4 GiB of RAM in its worst-case guarantees also imply that it has a
our experiments). similar behavior on all inputs, while for all engineered
superlinear algorithms [25,26,31, etc.] there exist “bad”
1 Introduction inputs where the running time shoots up by several
order of magnitudes.
Suffix arrays [16,24] are among the most popular data
Nonetheless, the simplicity of the DC3 algorithm
structures for full text indexing. They list all suffixes of
(mostly sorting and scanning) enables straightforward
a static text in lexicographically increasing order. This
adaptation to more advanced models of computation
not only allows to efficiently locate arbitrary patterns
(PRAM, EM, distributed, etc.), and usually leads to
in unstructured texts (like DNA, East Asian languages,
optimal algorithms in those models. In fact, there is a
etc.) in time proportional to the pattern length (as
fast EM implementation of DC3 [7] that outperformed
opposed to text length), but also fast phrase searches
all other external suffix sorters in practice at the time
(e.g., “to be or not to be”) if the suffix array is built
of its writing. Other external implementations of DC3
over the phrase beginnings only [11].
(or its variant DC7) confirmed those results [9].
The first and most important step in using suffix
In many applications (e.g., for fast string matching),
arrays is the efficient construction of the index (“suffix
the suffix array needs to be augmented with the longest
sorting”), the term “efficient” encompassing both time
common prefix array (LCP array for short), which holds
and space. Until recently, the text indexing commu-
the lengths of longest common prefixes of lexicographi-
nity was confronted with the dilemma that there were
cally consecutive suffixes. In internal memory, the LCP
theoretically fast algorithms for constructing suffix ar-
array can be constructed sufficiently fast. Indeed, the
rays (linear-time for integer alphabets) that were rather
currently fastest algorithm [13] also uses the induced
slow in practice [1], while other superlinear algorithms
sorting framework on which SAIS is based. In the
existed that outperformed the linear ones on all realistic
EM model, the DC3 suffix sorter can be augmented to
instances, in terms of both time and space [25,26,31]. In
also construct the LCP array within sorting complex-
particular, the extremely elegant difference cover algo-
ity. However, we are not aware of any previous imple-
mentation of this approach. Another purely theoreti-
*Supported by DFG SPP 1307. cal solution is to use the EM suffix tree algorithm [10]
†Supported by the German Research Foundation (DFG).
for constructing LCP arrays and derive the LCP ar-
‡Partially supported by EU Project No. 248481 (PEPPHER)
ray by an EM Euler tour over the tree. This approach
ICT-2009.3.6
88 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
