1, Violence solving
We can write different violence solving codes, and the subsequent improved versions are based on these two versions of codes.
1. Realize
This solution is easy to understand. We use two pointers to track the text string and the pattern string, respectively. If the matching is successful, the two pointers are incremented respectively; Otherwise, the text string is returned to the next character at the beginning of this round of matching. Finally, the difference between the two pointers is returned. Therefore, we cannot directly analyze the comparison results through the return value.
#pragma once #include <string> int bruteForce1(char* pattern, char* text) { auto patternLength = strlen(pattern); auto textLength = strlen(text); size_t indexOfPattern = 0; size_t indexOfText = 0; while (indexOfPattern < patternLength && indexOfText < textLength) { if (text[indexOfText] == pattern[indexOfPattern]) { ++indexOfPattern; ++indexOfText; } else { indexOfText -= (indexOfPattern - 1); indexOfPattern = 0; } } return indexOfText - indexOfPattern; }
2. Test code
#include "CppUnitTest.h" #include "../../Algorithm/stringMatch/bruteForce.h" #include <vector> #include <random> #include <cmath> #include <numeric> using namespace Microsoft::VisualStudio::CppUnitTestFramework; using namespace std; namespace BruteForce { TEST_CLASS(BruteForce) { public: TEST_METHOD(TestInclude) { char text[] = "textString for include"; char pattern[] = "or i"; auto pos = bruteForce1(pattern, text); int expected = 12; Assert::IsTrue(expected == pos); Assert::IsTrue(!strncmp(text + pos, pattern, strlen(pattern))); } TEST_METHOD(TestPartial) { char text[] = "textString for exclude"; char pattern[] = "cludede"; auto pos = bruteForce1(pattern, text); int expected = 17; Assert::IsTrue(expected == pos); } TEST_METHOD(TestExclude) { char text[] = "textString for exclude"; char pattern[] = "or i"; auto pos = bruteForce1(pattern, text); int expected = strlen(text); Assert::IsTrue(expected == pos); } }; }
2, KMP algorithm
1. Algorithm principle
KMP algorithm mainly uses the pattern of pattern string itself. Recall our previous solution 1. Its performance inefficiency is reflected in the comparison of all characters of pattern string and text string. However, by analyzing the whole comparison process, it is not difficult for us to find that those continuous and successful comparisons have given us enough information about text strings before encountering failed comparisons. This information can be obtained by analyzing only the pattern string.
Imagine the following pattern string: abcdeab. If we fail to compare the seventh character, six consecutive characters starting from the current position of the text string are abcde. Then it is not necessary for us to move the pointer back one step in the violence solution. Because we know that the five characters after the current pointer of the text string will not match the first character of the pattern string (i.e. "a"). Therefore, instead of moving the text string pointer, we directly move the pattern string pointer to the second character.
It is not difficult to find that such information (next table) can be obtained only through the pattern string.
2. next table
When we observe the moving distance of the pattern string in the figure, it is not difficult to find that when we encounter a mismatched character, we move by observing the pattern string of the matched part, whose prefix and suffix are exactly the same length.
According to its principle, we can construct the recurrence formula of the next table as follows:
n
e
x
t
[
j
+
1
]
≤
n
e
x
t
[
j
]
+
1
next[j+1] \le next[j]+1
next[j+1]≤next[j]+1
If and only if pattern[j] == pattern[next[j]], the above formula takes the equal sign; When the above formula takes the less than sign, we need to find the next table entry corresponding to the index next[j].
Note that the value of each item here is the maximum matching length of the prefix and suffix corresponding to all characters in front of it, excluding the matching of this character.
3. next table implementation
We introduced - 1 as a sentinel in the following code:
int* buildNext(char* pattern) { auto patternLength = strlen(pattern); size_t indexOfPattern = 0; int* next = new int[patternLength]; next[0] = -1; int match = -1; while (indexOfPattern < patternLength - 1) { if (match < 0 || pattern[indexOfPattern] == pattern[match]) { next[++indexOfPattern] = ++match; } else { match = next[match]; } } return next; }
4. kmp implementation
int kmp(char* pattern, char* text) { int* next = buildNext(pattern); int patternLength = strlen(pattern); int textLength = strlen(text); int indexOfPattern = 0; int indexOfText = 0; while (indexOfPattern < patternLength && indexOfText < textLength) { if (indexOfPattern < 0 || text[indexOfText] == pattern[indexOfPattern]) { ++indexOfPattern; ++indexOfText; } else { indexOfPattern = next[indexOfPattern]; } } delete[] next; return indexOfText - indexOfPattern; }
3, Reference
The above implementation and pictures are from Data structure (C + + language version).