Suffix array: find the needle in the hay

MartinLwx included in category Algorithm

2026-01-11 2026-01-11 1562 words 7 minutes

Contents

Suffix array

By definition, a suffix array (denoted as sa) contains the starting indices of all suffixes. It’s simply an int array where sa[i] represents the starting index of the corresponding suffix.

Taking fizzbuzz as an example, its suffix array is 4 0 1 5 7 3 6 2. The details are shown in the following table.

Suffix array `sa`	Corresponding suffix
4	`buzz`
0	`fizzbuzz`
1	`izzbuzz`
5	`uzz`
7	`z`
3	`zbuzz`
6	`zz`
2	`zzbuzz`

How to generate a suffix array

Naive way

Let $N$ represent the length of the string. The naive algorithm is straightforward: generate all possible suffixes and then sort them using an efficient sorting algorithm. This requires one sort iteration, which may involve $O(N log\ N)$ comparisons. In the worst case, each string comparison takes $O(N)$ time. Therefore, the overall time complexity would be

$$ 1\times O(Nlog\ N)\times O(N)=\boxed{O(N^2log\ N)} $$

It’s trivial to implement this, and I would leave it as an exercise for you.

Doubling algorithm

The doubling algorithm can be described as follows

In the 1st iteration, we compare all suffixes using the first $2^1$ characters.
In the 2nd iteration, we compare all suffixes using the first $2^2$ characters.
…
In the k-th iteration, we compare all suffixes using the first $2^k$ characters.

From these steps, it can be seen that in each iteration, the comparison length doubles from $2^{k-1}$ to $2^k$. This explains why it’s called the doubling algorithm.

The idea behind the doubling algorithm is: Leveraging sorted short strings to compare longer ones. For example, we can compare two strings of length $2^k$ by first comparing their first $2^{k-1}$ characters and then comparing their last $2^{k-1}$ characters.

To cache previous comparison result, the doubling algorithm introduces a ranking array (denoted as ra), which is also an int array. Before the k-th iteration begins, ra[sa[i]] represents the rank of the corresponding suffix when considering only its first $2^{k-1}$ characters.

Now let’s take a closer look at the k-th iteration. By leveraging the ra array, we can achieve

first compare their first $2^{k-1}$ characters => just compare $ra\big[sa[i]\big]$.
then compare their last $2^{k-1}$ characters => just compare $ra\big[sa[i]+2^{k-1}\big]$.

In summary, we can just use the ranking pair when we compare strings, considering the first $2^k$ characters.

$$ \Big( ra\big[sa[i]],ra\big[sa[i]+2^{k-1}\big] \Big) $$

Finally, let’s analyze the time complexity of the doubling algorithm: Due to the doubling effect, only $O(log\ N)$ sorting steps are required. Each sorting step has an overhead of $O(Nlog\ N)\times O(1)=O(Nlog\ N)$, where the $O(1)$ factor comes from comparing ranking pairs. Thus, the overall time complexity is

$$ O(log\ N)\times O(Nlog\ N)\times O(1)=\boxed{O(Nlog^2\ N)} $$

Better doubling algorithm

We can further optimize the doubling algorithm by using the radix sort. This reduces the time complexity of each sorting step to $O(N)$.

For example, we can perform the following steps if we want to compare these ranking pairs - (1,3), (2,2), (3,2)

Sort ranking pairs using second item: (2,2), (3,2), (1,3)
Sort ranking pairs using first item: (1,3), (2,2), (3,2)

The overall time complexity is

$$ O(log\ N)\times O(N)\times O(1)=\boxed{O(Nlog\ N)} $$

The C++ implementation

Doubling algorithm

One thing to note is that the index of the ra array may overflow. In such a situation, we should assign a dummy rank that preserves dictionary order. In my implementation, since valid rank values are greater than $1$, I use $0$ as the dummy rank.

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cin;
using std::cout;
using std::fill;
using std::string;
using std::vector;

int main() {
  string s;
  cin >> s;

  // Init
  vector<int> sa(s.size());
  vector<int> ra(s.size());
  for (int i = 0; i < s.size(); i++) {
    sa[i] = i;
    ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
  }

  vector<int> new_sa(s.size());
  for (int w = 1; w < s.size(); w <<= 1) {
    // Sort suffix array
    sort(sa.begin(), sa.end(), [&](int x, int y) {
      if (ra[x] != ra[y]) {
        return ra[x] < ra[y];
      }
      int rx{x + w < s.size() ? ra[x + w] : 0};
      int ry{y + w < s.size() ? ra[y + w] : 0};
      return rx < ry;
    });

    // Reranking
    vector<int> new_ra(s.size(), 1);
    int current_rank{1};
    new_ra[sa[0]] = 1;
    for (int i = 1; i < s.size(); i++) {
      int cur{sa[i]};
      int prev{sa[i - 1]};
      new_ra[sa[i]] =
          (ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
                                      (prev + w < s.size() ? ra[prev + w] : 0)
               ? current_rank
               : ++current_rank);
    }
    ra = new_ra;
  }
  for (int i = 0; i < s.size(); i++) {
    cout << sa[i] << (i == s.size() - 1 ? "" : " ");
  }
}

Better doubling algorithm

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cin;
using std::cout;
using std::fill;
using std::max;
using std::string;
using std::vector;

// Sort `sa` using `ra[i + delta]`
void radix_sort(vector<int> &sa, vector<int> &new_sa, const vector<int> &ra,
                int n, int delta) {
  // The valid rank for lowercase letters is 1 ~ 26
  // or the length of the input string + 1
  int max_rank{max(27, n + 1)};
  vector<int> count(max_rank, 0);

  // counting
  for (int i = 0; i < n; i++) {
    int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
    count[key]++;
  }

  for (int i = 0, prefix_sum = 0; i < max_rank; i++) {
    int temp{count[i]};
    count[i] = prefix_sum;
    prefix_sum += temp;
  }

  // invariant: all ranking pair in the same count[i] share the same rank
  for (int i = 0; i < n; i++) {
    int key{sa[i] + delta < n ? ra[sa[i] + delta] : 0};
    new_sa[count[key]++] = sa[i];
  }
  sa = new_sa;
}

int main() {
  string s;
  cin >> s;

  // Init
  vector<int> sa(s.size());
  vector<int> ra(s.size());
  for (int i = 0; i < s.size(); i++) {
    sa[i] = i;
    ra[i] = s[i] - 'a' + 1; // valid rank starts from 1
  }

  vector<int> new_sa(s.size());
  for (int w = 1; w < s.size(); w <<= 1) {
    // Sort suffix array
    radix_sort(sa, new_sa, ra, s.size(), w);
    radix_sort(sa, new_sa, ra, s.size(), 0);

    // Reranking
    vector<int> new_ra(s.size(), 1);
    int current_rank{1};
    new_ra[sa[0]] = 1;
    for (int i = 1; i < s.size(); i++) {
      int cur{sa[i]};
      int prev{sa[i - 1]};
      new_ra[sa[i]] =
          (ra[cur] == ra[prev] && (cur + w < s.size() ? ra[cur + w] : 0) ==
                                      (prev + w < s.size() ? ra[prev + w] : 0)
               ? current_rank
               : ++current_rank);
    }
    ra = new_ra;
  }
  for (int i = 0; i < s.size(); i++) {
    cout << sa[i] << (i == s.size() - 1 ? "" : " ");
  }
}

The walking through example

Take fizzbuzz as an example. Before the 1st iteration, we have

Suffix array `sa[i]`	ranking array `ra[sa[i]]`	ranking array `ra[sa[i]+1]`	suffix
0	6	9	`fizzbuzz`
1	9	26	`izzbuzz`
2	26	26	`zzbuzz`
3	26	2	`zbuzz`
4	2	21	`buzz`
5	21	26	`uzz`
6	26	26	`zz`
7	26	0	`z`

In 1st iteration, we need to compare strings considering the first $2^1=2$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+1]) to sort suffixes.

Suffix array `sa[i]`	ranking array `ra[sa[i]]`	ranking array `ra[sa[i]+1]`	suffix
4	2	21	`buzz`
0	6	9	`fizzbuzz`
1	9	26	`izzbuzz`
5	21	26	`uzz`
7	26	0	`z`
3	26	2	`zbuzz`
2	26	26	`zzbuzz`
6	26	26	`zz`

Now, let’s reassign the ranking (starting from 1) and prepare for the 2nd iteration.

Suffix array `sa[i]`	ranking array `ra[sa[i]]`	ranking array `ra[sa[i]+2]`	suffix
4	1	7	`buzz`
0	2	7	`fizzbuzz`
1	3	6	`izzbuzz`
5	4	5	`uzz`
7	5	0	`z`
3	6	4	`zbuzz`
2	7	1	`zzbuzz`
6	7	0	`zz`

In the 2nd iteration, we need to compare strings considering the first $2^2=4$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+2]) to sort suffixes.

Suffix array `sa[i]`	ranking array `ra[sa[i]]`	ranking array `ra[sa[i]+2]`	suffix
4	1	7	`buzz`
0	2	7	`fizzbuzz`
1	3	6	`izzbuzz`
5	4	5	`uzz`
7	5	0	`z`
3	6	4	`zbuzz`
6	7	0	`zz`
2	7	1	`zzbuzz`

Now, let’s reassign the ranking (starting from 1) and prepare for the 3st iteration.

Suffix array `sa[i]`	ranking array `ra[sa[i]]`	ranking array `ra[sa[i]+4]`	suffix
4	1	0	`buzz`
0	2	1	`fizzbuzz`
1	3	4	`izzbuzz`
5	4	0	`uzz`
7	5	0	`z`
3	6	5	`zbuzz`
6	7	0	`zz`
2	8	7	`zzbuzz`

In 3rd iteration, we need to compare strings considering the first $2^3=8$ characters, that is, we should use the ranking pair (ra[sa[i]], ra[sa[i]+4]) to sort suffixes. It turns out that it’s the final answer.

Application: Find the needle in the hay

The suffix array can speed up finding the needle (denoted as P) in the hay (denoted as S) by leveraging a key property of substrings - If a substring exists in S, it must be the prefix of some suffixes of S.

Algorithm:

Use binary search to find the lower bound (denoted as $L$) that has P as a prefix.
Use binary search to find the upper bound (denoted as $R$) that has P as a prefix.

All suffixes within [L, R] have string P as a prefix. And you can get the indices using the suffix array.

The overall time complexity：

$$ O(Llog\ N) $$

Where

$L$ is the length of string P.
$N$ is the length of string S.