Suffix array: find the needle in the hay
Suffix array
By definition, a suffix array (denoted as sa) contains the starting indices of all suffixes. It’s simply an int array where sa[i] represents the starting index of the corresponding suffix.
Taking fizzbuzz as an example, its suffix array is 4 0 1 5 7 3 6 2. The details are shown in the following table.
Suffix array sa |
Corresponding suffix |
|---|---|
| 4 | buzz |
| 0 | fizzbuzz |
| 1 | izzbuzz |
| 5 | uzz |
| 7 | z |
| 3 | zbuzz |
| 6 | zz |
| 2 | zzbuzz |
How to generate a suffix array
Naive way
Let $N$ represent the length of the string. The naive algorithm is straightforward: generate all possible suffixes and then sort them using an efficient sorting algorithm. This requires one sort iteration, which may involve $O(N log\ N)$ comparisons. In the worst case, each string comparison takes $O(N)$ time. Therefore, the overall time complexity would be
$$ 1\times O(Nlog\ N)\times O(N)=\boxed{O(N^2log\ N)} $$
It’s trivial to implement this, and I would leave it as an exercise for you.
Doubling algorithm
The doubling algorithm can be described as follows
- In the 1st iteration, we compare all suffixes using the first $2^1$ characters.
- In the 2nd iteration, we compare all suffixes using the first $2^2$ characters.
- …
- In the k-th iteration, we compare all suffixes using the first $2^k$ characters.
From these steps, it can be seen that in each iteration, the comparison length doubles from $2^{k-1}$ to $2^k$. This explains why it’s called the doubling algorithm.
The idea behind the doubling algorithm is: Leveraging sorted short strings to compare longer ones. For example, we can compare two strings of length $2^k$ by first comparing their first $2^{k-1}$ characters and then comparing their last $2^{k-1}$ characters.
To cache previous comparison result, the doubling algorithm introduces a ranking array (denoted as ra), which is also an int array. Before the k-th iteration begins, ra[sa[i]] represents the rank of the corresponding suffix when considering only its first $2^{k-1}$ characters.
Now let’s take a closer look at the k-th iteration. By leveraging the ra array, we can achieve
- first compare their first $2^{k-1}$ characters => just compare $ra\big[sa[i]\big]$.
- then compare their last $2^{k-1}$ characters => just compare $ra\big[sa[i]+2^{k-1}\big]$.
In summary, we can just use the ranking pair when we compare strings, considering the first $2^k$ characters.
$$ \Big( ra\big[sa[i]],ra\big[sa[i]+2^{k-1}\big] \Big) $$
Finally, let’s analyze the time complexity of the doubling algorithm: Due to the doubling effect, only $O(log\ N)$ sorting steps are required. Each sorting step has an overhead of $O(Nlog\ N)\times O(1)=O(Nlog\ N)$, where the $O(1)$ factor comes from comparing ranking pairs. Thus, the overall time complexity is
$$ O(log\ N)\times O(Nlog\ N)\times O(1)=\boxed{O(Nlog^2\ N)} $$
Better doubling algorithm
We can further optimize the doubling algorithm by using the radix sort. This reduces the time complexity of each sorting step to $O(N)$.
For example, we can perform the following steps if we want to compare these ranking pairs - (1,3), (2,2), (3,2)
- Sort ranking pairs using second item:
(2,2), (3,2), (1,3) - Sort ranking pairs using first item:
(1,3), (2,2), (3,2)
The overall time complexity is
$$ O(log\ N)\times O(N)\times O(1)=\boxed{O(Nlog\ N)} $$
The C++ implementation
Doubling algorithm
One thing to note is that the index of the ra array may overflow. In such a situation, we should assign a dummy rank that preserves dictionary order. In my implementation, since valid rank values are greater than $1$, I use $0$ as the dummy rank.
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using std::cin;
using std::cout;
using std::fill;
using std::string;
using std::vector;
int main() {
string s;
cin >> s;
// Init
vector<int> sa(s.size());
vector<int> ra(s.size());
for (int i = 0; i < s.size(); i++) {
sa[i] = i;
ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
}
vector<int> new_sa(s.size());
for (int w = 1; w < s.size(); w <<= 1) {
// Sort suffix array
sort(sa.begin(), sa.end(), [&](int x, int y) {
if (ra[x] != ra[y]) {
return ra[x] < ra[y];
}
int rx{x + w < s.size() ? ra[x + w] : 0};
int ry{y + w < s.size() ? ra[y + w] : 0};
return rx < ry;
});
// Reranking
vector<int> new_ra(s.size(), 1);
int current_rank{1};
new_ra[sa[0]] = 1;
for (int i = 1; i < s.size(); i++) {
int cur{sa[i]};
int prev{sa[i - 1]};
new_ra[sa[i]] =
(ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
(prev + w < s.size() ? ra[prev + w] : 0)
? current_rank
: ++current_rank);
}
ra = new_ra;
}
for (int i = 0; i < s.size(); i++) {
cout << sa[i] << (i == s.size() - 1 ? "" : " ");
}
}
Better doubling algorithm
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using std::cin;
using std::cout;
using std::fill;
using std::max;
using std::string;
using std::vector;
// Sort `sa` using `ra[i + delta]`
void radix_sort(vector<int> &sa, vector<int> &new_sa, const vector<int> &ra,
int n, int delta) {
// The valid rank for lowercase letters is 1 ~ 26
// or the length of the input string + 1
int max_rank{max(27, n + 1)};
vector<int> count(max_rank, 0);
// counting
for (int i = 0; i < n; i++) {
int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
count[key]++;
}
for (int i = 0, prefix_sum = 0; i < max_rank; i++) {
int temp{count[i]};
count[i] = prefix_sum;
prefix_sum += temp;
}
// invariant: all ranking pair in the same count[i] share the same rank
for (int i = 0; i < n; i++) {
int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
new_sa[count[key]++] = sa[i];
}
sa = new_sa;
}
int main() {
string s;
cin >> s;
// Init
vector<int> sa(s.size());
vector<int> ra(s.size());
for (int i = 0; i < s.size(); i++) {
sa[i] = i;
ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
}
vector<int> new_sa(s.size());
for (int w = 1; w < s.size(); w <<= 1) {
// Sort suffix array
radix_sort(sa, new_sa, ra, s.size(), w);
radix_sort(sa, new_sa, ra, s.size(), 0);
// Reranking
vector<int> new_ra(s.size(), 1);
int current_rank{1};
new_ra[sa[0]] = 1;
for (int i = 1; i < s.size(); i++) {
int cur{sa[i]};
int prev{sa[i - 1]};
new_ra[sa[i]] =
(ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
(prev + w < s.size() ? ra[prev + w] : 0)
? current_rank
: ++current_rank);
}
ra = new_ra;
}
for (int i = 0; i < s.size(); i++) {
cout << sa[i] << (i == s.size() - 1 ? "" : " ");
}
}
The walking through example
Take fizzbuzz as an example. Before the 1st iteration, we have
Suffix array sa[i] |
ranking array ra[sa[i]] |
ranking array ra[sa[i]+1] |
suffix |
|---|---|---|---|
| 0 | 6 | 9 | fizzbuzz |
| 1 | 9 | 26 | izzbuzz |
| 2 | 26 | 26 | zzbuzz |
| 3 | 26 | 2 | zbuzz |
| 4 | 2 | 21 | buzz |
| 5 | 21 | 26 | uzz |
| 6 | 26 | 26 | zz |
| 7 | 26 | 0 | z |
In 1st iteration, we need to compare strings considering the first $2^1=2$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+1]) to sort suffixes.
Suffix array sa[i] |
ranking array ra[sa[i]] |
ranking array ra[sa[i]+1] |
suffix |
|---|---|---|---|
| 4 | 2 | 21 | buzz |
| 0 | 6 | 9 | fizzbuzz |
| 1 | 9 | 26 | izzbuzz |
| 5 | 21 | 26 | uzz |
| 7 | 26 | 0 | z |
| 3 | 26 | 2 | zbuzz |
| 2 | 26 | 26 | zzbuzz |
| 6 | 26 | 26 | zz |
Now, let’s reassign the ranking (starting from 1) and prepare for the 2nd iteration.
Suffix array sa[i] |
ranking array ra[sa[i]] |
ranking array ra[sa[i]+2] |
suffix |
|---|---|---|---|
| 4 | 1 | 7 | buzz |
| 0 | 2 | 7 | fizzbuzz |
| 1 | 3 | 6 | izzbuzz |
| 5 | 4 | 5 | uzz |
| 7 | 5 | 0 | z |
| 3 | 6 | 4 | zbuzz |
| 2 | 7 | 1 | zzbuzz |
| 6 | 7 | 0 | zz |
In the 2nd iteration, we need to compare strings considering the first $2^2=4$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+2]) to sort suffixes.
Suffix array sa[i] |
ranking array ra[sa[i]] |
ranking array ra[sa[i]+2] |
suffix |
|---|---|---|---|
| 4 | 1 | 7 | buzz |
| 0 | 2 | 7 | fizzbuzz |
| 1 | 3 | 6 | izzbuzz |
| 5 | 4 | 5 | uzz |
| 7 | 5 | 0 | z |
| 3 | 6 | 4 | zbuzz |
| 6 | 7 | 0 | zz |
| 2 | 7 | 1 | zzbuzz |
Now, let’s reassign the ranking (starting from 1) and prepare for the 3st iteration.
Suffix array sa[i] |
ranking array ra[sa[i]] |
ranking array ra[sa[i]+4] |
suffix |
|---|---|---|---|
| 4 | 1 | 0 | buzz |
| 0 | 2 | 1 | fizzbuzz |
| 1 | 3 | 4 | izzbuzz |
| 5 | 4 | 0 | uzz |
| 7 | 5 | 0 | z |
| 3 | 6 | 5 | zbuzz |
| 6 | 7 | 0 | zz |
| 2 | 8 | 7 | zzbuzz |
In 3rd iteration, we need to compare strings considering the first $2^3=8$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+4]) to sort suffixes. It turns out that it’s the final answer.
Application: Find the needle in the hay
The suffix array can speed up finding the needle (denoted as P) in the hay (denoted as S) by leveraging a key property of substrings - If a substring exists in S, it must be the prefix of some suffixes of S.
Algorithm:
- Use binary search to find the lower bound (denoted as $L$) that has
Pas a prefix. - Use binary search to find the upper bound (denoted as $R$) that has
Pas a prefix.
All suffixes within [L, R] have string P as a prefix. And you can get the indices using the suffix array.
The overall time complexity:
$$ O(Llog\ N) $$
Where
- $L$ is the length of string
P. - $N$ is the length of string
S.