KMP algorithm (string matching problem) acm winter vacation training diary 22 / 1 / 19

Posted by ceanth on Wed, 19 Jan 2022 20:18:48 +0100

First, let's look at an example:

If we don't consider overtime, we can use the most simple method (violence) to solve it

//Violence algorithm (n*m) 
int ViolentMatch(char *s,char *p)
{
	int sLen = strlen(s);
	int pLen = strlen(p);
	
	int i = 0;
	int j = 0;
	while(i<sLen&&j<pLen)
	{
		if(s[i]==p[j])
		{
			i++;
			j++;
		}
		else
		{
			i = i-j+1;
			j = 0;
		}
	}
	if(j==pLen)
	return i-j;
	else
	return -1;
}

But the time complexity of n*m is obviously not good (1e7~1e8 - "1s")

This introduces a question: how to optimize?

It is not difficult for us to find that the most time-consuming of simple algorithms is a backtracking process (finding mismatches, j = 0,i = i-j+1). In short, it is i constant backtracking, which is meaningless in many cases (the human brain can know at a glance that the next one still doesn't match, but the naive algorithm is still a fool's constant backtracking)

Let's look at the following example: (the human brain knows at a glance that b does not match a, hopes to go to a to match a, and then compare whether the next one matches)

ababbba
abb
 find b And a Mismatch
 next step:

Simplicity:
ababbba
 abb
 We hope that:
ababbba
  abb

Of course, it's not necessarily such a coincidence. The next step is the answer directly.

Knowing the optimization scheme, the next step is to turn the ideal into reality!!!

KMP thinking:

In the naive algorithm, the i of the main string is constantly backtracked, but it can be seen from the above example that this backtracking is unnecessary. Good horses don't eat backtracking. KMP algorithm is to avoid this unnecessary backtracking.

Since i doesn't have to go back, we have to consider the change of j. Through the above example, we can know that if there are equal characters between the hand character of P string and the following characters, the change of j will be different. That is, the change of j has nothing to do with the main string. The key depends on whether the P string structure is repeated. The length of the repetition before and after this character becomes the Prefix suffix sum. As long as the Longest Prefix suffix sum is calculated, we can know how j changes.

This also leads to a problem: how to find the prefix and suffix???

Finding the longest prefix and suffix (a big difficulty)

A	B	C	A	B	X
0	0	0	1	2	0

Each character of the pattern string	prefix	suffix	Longest Prefix suffix
A	NULL	NULL	0
AB	A	B	0
ABC	A,AB	C,BC	0
ABCA	A,AB,ABC	A,CA,BCA	1
ABCAB	A,AB,ABC,ABCA	B,AB,CAB,BCAB	2
ABCABX	A,AB,ABC,ABCA,ABCAB	X,BX,ABX,CABX,BCABX	0

In case of mismatch, the number of bits the pattern string moves to the right is: number of matched characters - the longest prefix and suffix corresponding to the last character of the mismatched character.

We find that when a character mismatch is matched, it is not necessary to consider the current mismatched character. Moreover, every time we mismatch, we look at the maximum length corresponding to the last character of the mismatched character. Thus, the next array is introduced.

Given the string "ABCDABD", its next array can be obtained as follows:

Pattern string	A	B	C	D	A	B	D
Maximum length value	0	0	0	0	1	2	0
next array	-1 (agreed)	0	0	0	0	1	2

s[i] != p[j]
i unchanged
j = 0~j-1 Longest Prefix suffix
 That is: j = next[j]

Rule summary:

In case of mismatch:

*If next [J]= 0, then j = next[j];

*Otherwise, the next match j starts from 0, which means that the p string matches from scratch

*What does this mean? In essence, next[j] = k represents the pattern string before p[j]. In the substring, there are the same prefix and suffix with length K. With this next array, in KMP matching, when the character at j in the matching string is mismatched, the next step is to continue to match the text string with the character at next[j], which is equivalent to moving the j - next[j] bit of the pattern string to the right.

This is the backbone of KMP. The remaining difficulty is how to find the next array

KMP algorithm flow

Suppose that the text string s matches the i position and the pattern string p matches the j position;

*If s[i] == p[j], make I + +, j + +, and continue to match the next character;

*Otherwise s [i]= P [J], let I remain unchanged, j = next[j];

*Special circumstances:

If the character j returns to subscript 0 is still mismatched, it means that the s string starting with the current i cannot be paired with the p string, so i+1 should be used to try pairing from the next. In order to unify and simplify the code, we usually set next[0] = -1 and change the judgment conditions at the same time. The complete KMP process is as follows:

technological process:
initialization: i = 0,j = 0

Cycle conditions: i < slen && j < plen

Inside the cycle:
If j == -1 || s[i] == p[j] ,be i++ , j++
otherwise j = next[j]

Cycle external:
j == m Description substring found
j != m Description matching failed

Time complexity of KMP: o(n+m)

Code to get the next array:

void getnext()
{
	int n = strlen(p);
	Next[0] = -1;
	int k = -1,j = 0;
	while(j<n)
	{
		if(k==-1||p[j]==p[k])
		Next[++j] = ++k;
		//next[j+1] = k+1;
		else
		k = Next[k];
		//k is less than j, so it is always fallback 
	}
}

KMP Code:

int kmp()
{
	int n = strlen(s),m = strlen(p);
	int i = 0,j = 0;
	while(i<n&&j<m)
	{
		if(j==-1||s[i]==p[j])
		{
			i++;
			j++;
		}
		else
		j = Next[j];
	}
	if(j==m) return i-j;//The following table starts at 0 
	return -1;
}

Full code:

#include<iostream>
#include<cstring>
#include<cstdio>
using namespace std;
const int N = 1e6+9;
char s[N],p[N]; 
int Next[N],ans[N];
void getnext()
{
	int n = strlen(p);
	Next[0] = -1;
	int k = -1,j = 0;
	while(j<n)
	{
		if(k==-1||p[j]==p[k])
		Next[++j] = ++k;
		//next[j+1] = k+1;
		else
		k = Next[k];
		//k is less than j, so it is always fallback 
	}
}
int kmp()
{
	int n = strlen(s),m = strlen(p);
	int i = 0,j = 0;
	while(i<n&&j<m)
	{
		if(j==-1||s[i]==p[j])
		{
			i++;
			j++;
		}
		else
		j = Next[j];
	}
	if(j==m) return i-j;//The following table starts at 0 
	return -1;
}
int main()
{
	int round = 0;
	int n;
	while(~scanf("%d",&n)&&n)
	{
		scanf("%s",p);
		round++;
		getnext();
		printf("Test case #%d\n",round);
		for(int i = 1;i<=n;i++)
		{
			if(Next[i]&&i%(i-Next[i])==0)
			printf("%d %d\n",i,i/(i-Next[i]));
		}
		printf("\n");
	 } 
}

KMP application - minimum cycle section

kmp application:

Theorem: suppose the length of S is len, which s has a minimum cyclic section, the length of cyclic section L is len - next[len], and the substring is s[0...len-next[len]-1].

If len can be divided by len - next[len], it indicates that the string s can be completely formed by the cycle section, and the cycle period T = len/L.
If not, you need to add a few more letters to complete the description. The number to be supplemented is the number of cycles ＾ L - len%L ＾ L = len - next[len].

understand:

For a string, such as abcd abcd abcd, it is obtained by repeating the string abcd of length 4 three times, then the first eight bits of the original string must be equal to the last eight bits.

That is, for a string s, the length is len, which is obtained by repeating the string s with length L R times. When r > = 2, there must be s[0...len-L-1] = s[L...len-1], and the string subscript starts from 0

Then for KMP algorithm, there is next[len] = len-L. At this time, L must be the smallest (because the value of next is the maximum length equal to the prefix and suffix, that is, len-L is the largest, then L is the smallest when len has been determined).

Example: Power Strings

Given two strings a and b we define a*b to be their concatenation. For example, if a = "abc" and b = "def" then a*b = "abcdef". If we think of concatenation as multiplication, exponentiation by a non-negative integer is defined in the normal way: a^0 = "" (the empty string) and a^(n+1) = a*(a^n).

Input

Each test case is a line of input representing s, a string of printable characters. The length of s will be at least 1 and will not exceed 1 million characters. A line containing a period follows the last test case.

Output

For each s you should print the largest n such that s = a^n for some string a.

Sample Input

abcd
aaaa
ababab
.

Sample Output

1
4
3

Hint

This problem has huge input, use scanf instead of cin to avoid time limit exceed.

The AC code is as follows:

#include<iostream>
#include<cstring>
#include<cstdio>
using namespace std;
const int N = 1e6+9;
char s[N],p[N]; 
int Next[N];
void getnext()
{
	int n = strlen(p);
	Next[0] = -1;
	int k = -1,j = 0;
	while(j<n)
	{
		if(k==-1||p[j]==p[k])
		Next[++j] = ++k;
		//next[j+1] = k+1;
		else
		k = Next[k];
		//k is less than j, so it is always fallback 
	}
}

int main()
{
	while(scanf("%s",p),p[0]!='.')
	{
		getnext();
		//Minimum cycle section 
		int len = strlen(p);
		int L = len - Next[len];
		if(len%L==0)
		printf("%d\n",len/L);
		else
		printf("1\n");
	}
}

Finally, thank you for reading!!!

ps: the content of the article comes from KMP algorithm detailed, read will_ Beep beep beep_ bilibili video

If there is infringement, please contact me and I will actively cooperate to delete the article

This video is of great help to me. At the same time, I also recommend it to my friends who are beginning to learn KMP algorithm!!!

As long as you have one breath, keep fighting, keep breathing, keep breathing—— Wilderness Hunter

Programmer Think