Introduction to suffix array and Golang implementation

Posted by luv2climb on Mon, 03 Jan 2022 14:42:12 +0100

Suffix array is used to solve a series of problems related to string pattern matching. Because I compare dishes and am not ACMer, this article is mainly used to literacy, which is very simple

First introduce some concepts:

sa: suffix array. Rank all n suffixes of the original string (length n) in dictionary order. sa[i]=k means that the starting subscript of the suffix ranked I in the original string is K. abbreviation: sa is the subscript of the ranking

Rk: rank array. rk[i]=k indicates that the suffix with the initial subscript i of the original string ranks K in the field order of all suffixes. Abbreviation: rk is the ranking of subscripts. Obviously, sa[rk[i]]=i. The ranking of this article starts from 1 and the subscript starts from 0, so sa[rk[i]-1]=i in this article

Height: height array. height[i]=LCP(sa[i], sa[i-1]), that is, the longest common prefix of two adjacent suffixes

An example is given to intuitively show these three concepts.

Original string: banana.

All suffixes are ranked in dictionary order: a, ana, anana, banana, na, nana

The rank of the 0th suffix (banana) is 4, rk[0]=4, sa[4-1]=0

The ranking of the first suffix (anana) is 3, rk[1]=3, sa[3-1]=1

The rank of the second suffix (nana) is 6, rk[2]=6, sa[6-1]=2

The third suffix (ana) is ranked as 2, rk[3]=2, sa[2-1]=3

The fourth suffix (na) is ranked as 5, rk[4]=5, sa[5-1]=4

The ranking of the fifth suffix (a) is 1, rk[5]=1, sa[1-1]=5

So sa=[5, 3, 1, 0, 4, 2], rk=[4, 3, 6, 2, 5, 1]

Then calculate the height array according to the definition.

The LCP length of '' and 'a' is 0, height[0]=0

The LCP length of "a" and "ana" is 1, height[1]=1

The LCP length of "ana" and "anana" is 3, height[2]=3

The LCP length of "anana" and "banana" is 0, height[3]=0

The LCP length of "banana" and "na" is 0, height[4]=0

The LCP length of "na" and "nana" is 2, height[5]=2

Therefore, height=[0, 1, 3, 0, 0, 2]

Next, we introduce how to find sa

The plain approach is obviously to list all suffixes and sort them (if it's fast)

Write the recursive equation: T(n)=2T(n/2)+O(n) ²)， Therefore, the complexity is O(n) ²)

Note: n ² How did you get here? In the original fast platoon, n numbers and pivot are compared, which is O(n), but here is not a number comparison, but a string comparison. The string is n long, compared n times, and the complexity is O(n) ²). According to the master theorem, the total complexity is O(n) ²). I see someone say, "fast row is O (n logn), string comparison is O(n), and the total complexity is O(n) ² Logn is obviously wrong.

One of the simplest and most common optimization methods: binary lifting to find sa, and the complexity is O(nlogn)

The idea is divide and conquer.

First, find the ranking of all substrings with length of 1

Assuming that the ranking of all substrings with length k has been calculated, the two adjacent substrings can form a new substring with length k*2, and the ranking of all substrings with length k*2 can be obtained from the ranking of the two substrings that make up it. That is, first compare the ranking of the first substring. If they are equal, then compare the ranking of the second substring.

Substrings whose length is not a power of 2 can be zeroed.

Let's take the example of banana. First calculate the ranking of all substrings with length of 1

b(2) a(1) n(3) a(1) n(3) a(1)

Then merge adjacent substrings into substrings with length of 2 (indicates zero filling):

ba(21) an(13) na(31) an(13) na(31) a_(10)

Then calculate the ranking of substrings with length of 2. Because 10 < 13 < 21 < 31, we get:

ba(3) an(2) na(4) an(2) na(4) a_(1)

Then merge the adjacent substrings (note that they are adjacent in the original string) into substrings with a length of 4:

bana(34) anan(22) nana(44) ana_(21) na__(40) a___(10)

Then calculate the ranking of substrings with length of 4. Because 10 < 21 < 22 < 34 < 40 < 44, we get:

bana(4) anan(3) nana(6) ana_(2) na__(5) a___(1)

Then merge the adjacent substrings into substrings with a length of 8:

banana__(45) anana___(31) nana____(60) ana_____(20) na______(50) a_______(10)

Then calculate the ranking of substrings with a length of 8. Because 10 < 20 < 31 < 45 < 50 < 60, we get:

banana__(4) anana___(3) nana____(6) ana_____(2) na______(5) a_______(1)

Ignoring zero padding, we find that this is the dictionary order ranking of all suffixes, that is, rk.

So rk=[4, 3, 6, 2, 5, 1]. sa can be obtained from sa[rk[i]-1].

Next, how to calculate height

The simple method is to calculate directly according to the definition. It is necessary to make n string comparisons, and the complexity is O(n) ²)， Not considered.

Lemma: height[rk[i]] ≥ height[rk[i-1]]-1

Loosely prove that:

Let k=rk[i-1]-1

height[rk[i]] = LCP(sa[rk[i]], sa[rk[i]-1]) = LCP(i, sa[rk[i]-1])

height[rk[i-1]] = LCP(sa[rk[i-1]], sa[rk[i-1]-1]) = LCP(i-1, k)

Let height[rk[i-1]) = LCP(suffix(i-1), suffix(k))

Then there is suffix(k+1), so that LCP(suffix(i), suffix(k+1)) = LCP(suffix(i-1), suffix(k)) - 1, that is, the lower bound of height[rk[i]] is height[rk[i-1]]-1

With this lemma, we can get the height array in O(n) time. We only need to calculate the height in the process of traversing rk, and the height of each iteration can be reduced by 1 at most

Complete code (Golang):

package main

import (
	"fmt"
	"sort"
)

type suffix struct {
	c   rune
	idx int // a tuple (c, idx) indicates a suffix in source string
}

type SA struct {
	src    string
	sa     []int
	rk     []int
	height []int
}

func NewSA(src string) *SA {
	n := len(src)
	sa := make([]suffix, n)
	for i, c := range src {
		sa[i] = suffix{c, i}
	}
	// init sa: sort by ascii
	sort.Slice(sa, func(i, j int) bool {
		return sa[i].c < sa[j].c
	})

	rk := make([]int, n)
	// init rk: incr rk each time sa[i].c changes
	rk[sa[0].idx] = 1
	for i := 1; i < n; i++ {
		rk[sa[i].idx] = rk[sa[i-1].idx]
		if sa[i].c != sa[i-1].c {
			rk[sa[i].idx]++
		}
	}

	// binary lifting
	// k is length of substr needed comparing
	// when the last of rk is n, terminate
	for k := 2; rk[sa[n-1].idx] < n; k <<= 1 {
		// sort sa by rank tuple (first, second)
		sort.Slice(sa, func(i, j int) bool {
			ii, jj := sa[i].idx, sa[j].idx
			// compare first part
			if rk[ii] != rk[jj] {
				return rk[ii] < rk[jj]
			}
			// compare second part
			if ii+k/2 >= n {
				return true // special case: second part not exists
			}
			if jj+k/2 >= n {
				return false
			}
			return rk[ii+k/2] < rk[jj+k/2]
		})
		// update rk
		rk[sa[0].idx] = 1
		for i := 1; i < n; i++ {
			cur, pre := sa[i].idx, sa[i-1].idx
			rk[cur] = rk[pre]
			// incr rk each time the substring changes. notice that an out-of-bound case indicates a change
			if cur+k > n || pre+k > n || src[cur:cur+k] != src[pre:pre+k] {
				rk[cur]++
			}
		}
	}
	realSA := make([]int, n)
	for i, v := range sa {
		realSA[i] = v.idx
	}
	// compute height
	height := make([]int, n)
	k := 0
	for i := 0; i < n; i++ {
		if rk[i] == 1 {
			continue
		}
		j := realSA[rk[i]-2]
		if k > 0 {
			k-- // invariant condition: height[rank[i]]>=height[rank[i-1]]-1
		}
		for i+k < n && j+k < n && src[i+k] == src[j+k] {
			k++
		}
		height[rk[i]-1] = k
	}

	return &SA{src, realSA, rk, height}
}

Note that in the above code, when calculating sa, the outer loop logs n times, and each round is fast arranged, with complexity O(nlogn) and total complexity O(nlog) ² n). It can also continue to optimize with cardinality sorting, and the total complexity is O(nlogn)

Example 1:

The repeated substring must be the LCP of two suffixes of the original string, and the dictionary order of the two suffixes is adjacent (if they are not adjacent, a suffix between them may produce a longer LCP with the first suffix)

So traverse the height array to find the maximum value.

func longestDupSubstring(s string) string {
	SA := NewSA(s)
	sa, height := SA.sa, SA.height
	max := 0
	idx := -1
	for i := 1; i < len(s); i++ {
		if max < height[i] {
			max = height[i]
			idx = sa[i]
		}
	}
	if idx == -1 {
		return ""
	}
	return s[idx : idx+max]
}

Example 2: pattern string matching

For string s, find all positions where pattern string p appears in S.

Consider the suffix array sa of s

Let I be the first subscript of sa, so that s[sa[i] > = P (because the subscript of sa array is the ranking, if the subscript is greater than I, the dictionary order ranking is also greater than s[sa[i]])

Let J be the first subscript of sa after i, so that p is not the prefix of s[sa[j]:] (therefore, any subscript greater than j in the sa array cannot be the prefix of p, and all subscripts less than j (and greater than i) are prefixed with p)

Then the subscript greater than j in the sa array cannot be the prefix of p

By definition, j-1 satisfies that s[sa[j-1]:] is prefixed with p, and the dictionary order from s[sa[i]:] to s[sa[j-2]:] is not greater than s[sa[j-1]:], so they are prefixed with p.

Therefore, sa[i:j] is required.

Complexity: two binary searches O(logn), in which each round of the second search needs to compare strings. The complexity is O(len(p)), so the total complexity is O(len(p)*logn)

func (sa *SA) Match(p string) []int {
	i := sort.Search(len(sa.sa), func(i int) bool {
		return sa.src[sa.sa[i]:] >= p
	})
	j := i + sort.Search(len(sa.sa)-i, func(j int) bool {
		return !strings.HasPrefix(sa.src[sa.sa[i+j]:], p)
	})
	return sa.sa[i:j]
}

Topics: Algorithm

Programmer Think

Introduction to suffix array and Golang implementation

Hot Topics