1. Definition of string
-
String (or string): a finite sequence consisting of zero or more characters, marked as s = "a_1 a_2 ··· a_n" (n ≥ 0)
- s: The name of the string. The character sequence enclosed in double quotation marks is the value of the string
- a_i (1 ≤ i ≤ n): it can be letters, numbers or other characters
- n: Length of string
- null string: a string of zero characters with a length of zero
-
Substring: a subsequence of any consecutive characters in a string
-
Main string: a string containing substrings
-
Position: the ordinal number of the character in the sequence
-
The position of the substring in the main string: expressed by the position of the first character of the substring in the main string
a = "abcd"; b = "ABCD"; c = "abcdABCD"; d = "abcd ABCD"; // a and b are substrings of c and d // The position of a in c and d is 1 // The position of b in c is 5 // The position of b in d is 6
-
-
Equal: the values of two strings are equal
- Two strings are equal in length
- The characters in each corresponding position are equal
-
blank string: a string consisting of one or more spaces
- Not an empty string
- The length is the number of space characters in the string
- The symbol "∅" is used to represent "empty string"
2. Storage structure and operation of string
2.1 string storage structure
- Most strings adopt sequential storage structure
2.1.1 sequential storage of strings
- Fixed length sequential storage structure
#define MAXLEN 10 typedef struct { char ch[MAXLEN + 1]; // One dimensional array of storage strings int length; // The current length of the string } SString;
- Heap sequential storage structure
typedef struct { char *ch; // If it is a non empty string, the storage area is allocated according to the string length, otherwise ch is NULL int length; // The current length of the string } HString;
2.1.2 chain storage of strings
#define CHUNKSIZE 5 typedef struct Chunk { char ch[CHUNKSIZE]; struct Chunk *next; } Chunk; typedef struct { Chunk *head; // String header pointer Chunk *tail; // End of string pointer int length; // The current length of the string } LString;
2.2 pattern matching algorithm of string
-
Pattern matching / String Matching: positioning operation of substring
-
Supplement: the following codes use the fixed length sequential storage structure
- Not recommended, too cumbersome
- Directly define the string, and use strlen() to obtain the length of the string
2.2.1 BF algorithm
- BF: Brute - Force
- step
- The counting pointers i and j are used to indicate the character positions currently to be compared in the main string S and mode (substring) T. the initial value of i is pos and the initial value of J is 1
- When i and j are less than or equal to the length of S and T respectively, the following operations are performed in a loop
- S. Comparison of CH [i] and T.ch[j]
- Equal: i and j respectively indicate the next position in the string and continue to compare subsequent characters
- Unequal: the pointer retreats and starts matching again, and compares with the first character of the pattern again from the next character of the main string
- S. Comparison of CH [i] and T.ch[j]
void Index_BF(SString S, SString T, int index) { int i = index; int j = 0; while (i <= S.length - 1 && j <= T.length - 1) { printf("i = %d, j = %d\n", i, j); if (S.ch[i] == T.ch[j]) { i++; j++; } else { i = ++index; j = 0; } } if (j > T.length - 1) { printf("Success\n"); } else { printf("Error\n"); } }
-
Best case: each unsuccessful match occurs when the first character of the pattern string is compared with the corresponding character in the main string
S = "aaaaaba"; T = "ba";
-
If the length of the main string is n, the length of the sub string is m, and the matching success probability is equal everywhere, the average comparison times is
∑ i = 1 n − m + 1 p n ( i − 1 + m ) = 1 n − m + 1 ∑ i = 1 n − m + 1 i − 1 + m = 1 2 ( n + m ) \sum_{i=1}^{n-m+1}{p_n(i-1+m)} = \frac{1}{n-m+1}\sum_{i=1}^{n-m+1}{i-1+m} = \frac{1}{2}(n+m) i=1∑n−m+1pn(i−1+m)=n−m+11i=1∑n−m+1i−1+m=21(n+m)
Therefore, the average time complexity in the best case is O(n+m)
2.2.2 KMP algorithm (can't understand)
- Whenever there are different character comparisons during one match, the i pointer is not traced back, but the obtained "partial match" result is used to "slide" the pattern to the right as far as possible, and then continue the comparison
void Index_KMP(SString S, SString T) { int next[10]; // Initialize the next array according to the mode string T Next(T, next); int i = 1; int j = 1; while (i <= S.length && j <= T.length) { // j==0: the first character of the mode string is not equal to the character of the current test // S[i-1] == T.ch[j-1]: if the corresponding position characters are equal, in both cases, the subscripts i and j of the two pointers pointing to the current test move backward if (j == 0 || S.ch[i-1] == T.ch[j-1]) { i++; j++; } else { // If the two characters of the test are not equal, i will not move and j will become the next value of the current test string j = next[j]; } } if (j > T.length) { printf("Success\n"); } else { printf("Error\n"); } } void Next(SString T, int* next) { int i = 1; next[1] = 0; int j = 0; while (i < T.length) { if (j == 0 || T.ch[i-1] == T.ch[j-1]) { i++; j++; next[i] = j; } else { j = next[j]; } } }
Test code
#include <stdio.h> #include <string.h> #define MAXLEN 10 void InitString(SString); void AssignString(SString); void Index_BF(SString, SString); void Index_KMP(SString, SString); void Next(SString); typedef struct { char ch[MAXLEN + 1]; int length; } SString; int main() { SString S1; SString S2; char c1[10] = "abcABC"; char c2[5] = "abc"; InitString(&S1); InitString(&S2); printf("****************\n"); AssignString(&S1, c1); AssignString(&S2, c2); printf("****************\n"); Index_BF(S1, S2, 0); printf("****************\n"); Index_KMP(S1, S2); printf("****************\n"); } void InitString(SString* S) { S->ch[0] = '\0'; S->length = 0; printf("Init Success\n"); } void AssignString(SString* S, char* C) { int len = S->length = strlen(C); for (int i = 0; i < len; i++) { S->ch[i] = C[i]; } S->ch[len] = '\0'; printf("Assign Success\n"); } void Index_BF(SString S, SString T, int index) { int i = index; int j = 0; while (i <= S.length - 1 && j <= T.length - 1) { printf("i = %d, j = %d\n", i, j); if (S.ch[i] == T.ch[j]) { i++; j++; } else { i = ++index; j = 0; } } if (j > T.length - 1) { printf("Success\n"); } else { printf("Error\n"); } } void Index_KMP(SString S, SString T) { int next[10]; // Initialize the next array according to the mode string T Next(T, next); int i = 1; int j = 1; while (i <= S.length && j <= T.length) { // j==0: the first character of the mode string is not equal to the character of the current test // S[i-1] == T.ch[j-1]: if the corresponding position characters are equal, in both cases, the subscripts i and j of the two pointers pointing to the current test move backward if (j == 0 || S.ch[i-1] == T.ch[j-1]) { i++; j++; } else { // If the two characters of the test are not equal, i will not move and j will become the next value of the current test string j = next[j]; } } if (j > T.length) { printf("Success\n"); } else { printf("Error\n"); } } void Next(SString T, int* next) { int i = 1; next[1] = 0; int j = 0; while (i < T.length) { if (j == 0 || T.ch[i-1] == T.ch[j-1]) { i++; j++; next[i] = j; } else { j = next[j]; } } }