Contents

Suffix array: find the needle in the hay

By definition, a suffix array (denoted as sa) contains the starting indices of all suffixes. It’s simply an int array where sa[i] represents the starting index of the corresponding suffix.

Taking fizzbuzz as an example, its suffix array is 4 0 1 5 7 3 6 2. The details are shown in the following table.

Suffix array sa Corresponding suffix
4 buzz
0 fizzbuzz
1 izzbuzz
5 uzz
7 z
3 zbuzz
6 zz
2 zzbuzz

Let $N$ represent the length of the string. The naive algorithm is straightforward: generate all possible suffixes and then sort them using an efficient sorting algorithm. This requires one sort iteration, which may involve $O(N log\ N)$ comparisons. In the worst case, each string comparison takes $O(N)$ time. Therefore, the overall time complexity would be

$$ 1\times O(Nlog\ N)\times O(N)=\boxed{O(N^2log\ N)} $$

It’s trivial to implement this, and I would leave it as an exercise for you.

The doubling algorithm can be described as follows

  • In the 1st iteration, we compare all suffixes using the first $2^1$ characters.
  • In the 2nd iteration, we compare all suffixes using the first $2^2$ characters.
  • In the k-th iteration, we compare all suffixes using the first $2^k$ characters.

From these steps, it can be seen that in each iteration, the comparison length doubles from $2^{k-1}$ to $2^k$. This explains why it’s called the doubling algorithm.

The idea behind the doubling algorithm is: Leveraging sorted short strings to compare longer ones. For example, we can compare two strings of length $2^k$ by first comparing their first $2^{k-1}$ characters and then comparing their last $2^{k-1}$ characters.

To cache previous comparison result, the doubling algorithm introduces a ranking array (denoted as ra), which is also an int array. Before the k-th iteration begins, ra[sa[i]] represents the rank of the corresponding suffix when considering only its first $2^{k-1}$ characters.

Now let’s take a closer look at the k-th iteration. By leveraging the ra array, we can achieve

  • first compare their first $2^{k-1}$ characters => just compare $ra\big[sa[i]\big]$.
  • then compare their last $2^{k-1}$ characters => just compare $ra\big[sa[i]+2^{k-1}\big]$.

In summary, we can just use the ranking pair when we compare strings, considering the first $2^k$ characters.

$$ \Big( ra\big[sa[i]],ra\big[sa[i]+2^{k-1}\big] \Big) $$

Finally, let’s analyze the time complexity of the doubling algorithm: Due to the doubling effect, only $O(log\ N)$ sorting steps are required. Each sorting step has an overhead of $O(Nlog\ N)\times O(1)=O(Nlog\ N)$, where the $O(1)$ factor comes from comparing ranking pairs. Thus, the overall time complexity is

$$ O(log\ N)\times O(Nlog\ N)\times O(1)=\boxed{O(Nlog^2\ N)} $$

We can further optimize the doubling algorithm by using the radix sort. This reduces the time complexity of each sorting step to $O(N)$.

For example, we can perform the following steps if we want to compare these ranking pairs - (1,3), (2,2), (3,2)

  • Sort ranking pairs using second item: (2,2), (3,2), (1,3)
  • Sort ranking pairs using first item: (1,3), (2,2), (3,2)

The overall time complexity is

$$ O(log\ N)\times O(N)\times O(1)=\boxed{O(Nlog\ N)} $$

One thing to note is that the index of the ra array may overflow. In such a situation, we should assign a dummy rank that preserves dictionary order. In my implementation, since valid rank values are greater than $1$, I use $0$ as the dummy rank.

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cin;
using std::cout;
using std::fill;
using std::string;
using std::vector;

int main() {
  string s;
  cin >> s;

  // Init
  vector<int> sa(s.size());
  vector<int> ra(s.size());
  for (int i = 0; i < s.size(); i++) {
    sa[i] = i;
    ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
  }

  vector<int> new_sa(s.size());
  for (int w = 1; w < s.size(); w <<= 1) {
    // Sort suffix array
    sort(sa.begin(), sa.end(), [&](int x, int y) {
      if (ra[x] != ra[y]) {
        return ra[x] < ra[y];
      }
      int rx{x + w < s.size() ? ra[x + w] : 0};
      int ry{y + w < s.size() ? ra[y + w] : 0};
      return rx < ry;
    });

    // Reranking
    vector<int> new_ra(s.size(), 1);
    int current_rank{1};
    new_ra[sa[0]] = 1;
    for (int i = 1; i < s.size(); i++) {
      int cur{sa[i]};
      int prev{sa[i - 1]};
      new_ra[sa[i]] =
          (ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
                                      (prev + w < s.size() ? ra[prev + w] : 0)
               ? current_rank
               : ++current_rank);
    }
    ra = new_ra;
  }
  for (int i = 0; i < s.size(); i++) {
    cout << sa[i] << (i == s.size() - 1 ? "" : " ");
  }
}
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cin;
using std::cout;
using std::fill;
using std::max;
using std::string;
using std::vector;

// Sort `sa` using `ra[i + delta]`
void radix_sort(vector<int> &sa, vector<int> &new_sa, const vector<int> &ra,
                int n, int delta) {
  // The valid rank for lowercase letters is 1 ~ 26
  // or the length of the input string + 1
  int max_rank{max(27, n + 1)};
  vector<int> count(max_rank, 0);

  // counting
  for (int i = 0; i < n; i++) {
    int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
    count[key]++;
  }

  for (int i = 0, prefix_sum = 0; i < max_rank; i++) {
    int temp{count[i]};
    count[i] = prefix_sum;
    prefix_sum += temp;
  }

  // invariant: all ranking pair in the same count[i] share the same rank
  for (int i = 0; i < n; i++) {
    int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
    new_sa[count[key]++] = sa[i];
  }
  sa = new_sa;
}

int main() {
  string s;
  cin >> s;

  // Init
  vector<int> sa(s.size());
  vector<int> ra(s.size());
  for (int i = 0; i < s.size(); i++) {
    sa[i] = i;
    ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
  }

  vector<int> new_sa(s.size());
  for (int w = 1; w < s.size(); w <<= 1) {
    // Sort suffix array
    radix_sort(sa, new_sa, ra, s.size(), w);
    radix_sort(sa, new_sa, ra, s.size(), 0);

    // Reranking
    vector<int> new_ra(s.size(), 1);
    int current_rank{1};
    new_ra[sa[0]] = 1;
    for (int i = 1; i < s.size(); i++) {
      int cur{sa[i]};
      int prev{sa[i - 1]};
      new_ra[sa[i]] =
          (ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
                                      (prev + w < s.size() ? ra[prev + w] : 0)
               ? current_rank
               : ++current_rank);
    }
    ra = new_ra;
  }
  for (int i = 0; i < s.size(); i++) {
    cout << sa[i] << (i == s.size() - 1 ? "" : " ");
  }
}

Take fizzbuzz as an example. Before the 1st iteration, we have

Suffix array sa[i] ranking array ra[sa[i]] ranking array ra[sa[i]+1] suffix
0 6 9 fizzbuzz
1 9 26 izzbuzz
2 26 26 zzbuzz
3 26 2 zbuzz
4 2 21 buzz
5 21 26 uzz
6 26 26 zz
7 26 0 z

In 1st iteration, we need to compare strings considering the first $2^1=2$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+1]) to sort suffixes.

Suffix array sa[i] ranking array ra[sa[i]] ranking array ra[sa[i]+1] suffix
4 2 21 buzz
0 6 9 fizzbuzz
1 9 26 izzbuzz
5 21 26 uzz
7 26 0 z
3 26 2 zbuzz
2 26 26 zzbuzz
6 26 26 zz

Now, let’s reassign the ranking (starting from 1) and prepare for the 2nd iteration.

Suffix array sa[i] ranking array ra[sa[i]] ranking array ra[sa[i]+2] suffix
4 1 7 buzz
0 2 7 fizzbuzz
1 3 6 izzbuzz
5 4 5 uzz
7 5 0 z
3 6 4 zbuzz
2 7 1 zzbuzz
6 7 0 zz

In the 2nd iteration, we need to compare strings considering the first $2^2=4$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+2]) to sort suffixes.

Suffix array sa[i] ranking array ra[sa[i]] ranking array ra[sa[i]+2] suffix
4 1 7 buzz
0 2 7 fizzbuzz
1 3 6 izzbuzz
5 4 5 uzz
7 5 0 z
3 6 4 zbuzz
6 7 0 zz
2 7 1 zzbuzz

Now, let’s reassign the ranking (starting from 1) and prepare for the 3st iteration.

Suffix array sa[i] ranking array ra[sa[i]] ranking array ra[sa[i]+4] suffix
4 1 0 buzz
0 2 1 fizzbuzz
1 3 4 izzbuzz
5 4 0 uzz
7 5 0 z
3 6 5 zbuzz
6 7 0 zz
2 8 7 zzbuzz

In 3rd iteration, we need to compare strings considering the first $2^3=8$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+4]) to sort suffixes. It turns out that it’s the final answer.

The suffix array can speed up finding the needle (denoted as P) in the hay (denoted as S) by leveraging a key property of substrings - If a substring exists in S, it must be the prefix of some suffixes of S.

Algorithm:

  • Use binary search to find the lower bound (denoted as $L$) that has P as a prefix.
  • Use binary search to find the upper bound (denoted as $R$) that has P as a prefix.

All suffixes within [L, R] have string P as a prefix. And you can get the indices using the suffix array.

The overall time complexity

$$ O(Llog\ N) $$

Where

  • $L$ is the length of string P.
  • $N$ is the length of string S.