Java string matching algorithm

Posted by trinitywave on Mon, 08 Nov 2021 10:12:51 +0100

definition

A string is a finite sequence of zero or more characters, also known as a string.

  • Generally, a string composed of N strings is recorded as S = "a0a1... an-1" (n ≥ 0), where a_i(1≤i≤n)
  • n is a finite number
  • The string is generally marked as S, which is the name of the string, and the character sequence enclosed in quotation marks is the value of the string
  • It can be letters, numbers or other characters. i is the position of the character in the string. The number of characters n in the string is called the length of the string

Substring

When processing the string s, it is often necessary to take out a continuous fragment, which is called the substring of S

  • Specifically, a substring consisting of k consecutive characters starting at position i in the string S is recorded as substr(S,i,k) = "aiai+1... ai+k-1", 0 ≤ i (n,0 ≤ k)
  • prefix(S,k) = substr(S,0,k);
  • Suffix suffix(S,k) = substr(S,n-k,k)
  • Space string: a string containing only spaces.

BF string method

basic thought

  1. Start from the first character of the main string and compare it with the first character of the substring. If it is equal, continue the subsequent comparison of the string
  2. If they are not equal, they will be re compared with the first character of the substring from the second character of the main string, and so on until each character in the substring is equal to a continuous character sequence in the main string. At this time, it is called successful matching.
  3. If the same character sequence as the substring cannot be found in the main string, the matching fails.

Backtracking is required, otherwise equal parts will be missed

 /**
     *
     * @param parent Main string
     * @param sub Substring
     */
    public static void bruteForce(String parent,String sub){
        //Successfully matched location
        int index = -1;
        //Length of main string
        int pLen = parent.length();
        //Length of substring
        int sLen = sub.length();

        if (pLen<sLen){
            System.out.println("Error.The main string is greater than the sub string length.");
            return;
        }

        int i = 0;
        int j = 0;
        while (i<pLen&&j<sLen){
            // Judge whether the characters in the corresponding position are equal
            if (parent.charAt(i)==sub.charAt(j)){
               //If equal, the main string and substring continue to compare
               i++;
               j++;
            }else{
                //The main string is traced back to the next character from the last start of matching
                i = i- j+1;
                j = 0;
            }
        }
        //Match successful
        if (j >= sLen) {
            index = i - j;
            System.out.println("Successful match,index is:" + index);
        } else {// Matching failed
            System.out.println("Match failed.");
        }
    }

KMP algorithm

Its core idea is that the main string does not backtrack, and the mode string moves to the right as much as possible

First, construct the next table (the next table stores the maximum length of substrings with the same true suffix and true prefix):

Take ABCDABD as an example to illustrate how to build the next table

P = ABCDABD
j = 0, prefix(P, 0) = φ
next[0] = -1;//This is the rule

P = ABCDABD
j = 1, prefix(P, 1) = A
 True prefix: φ
True suffix: φ
next[1] = 0;

P = ABCDABD
j = 2, prefix(P, 2) = AB
 True prefix: A
 True suffix: B
next[2] = 0;

P = ABCDABD
j = 3, prefix(P, 3) = ABC
 True prefix: A,AB
 True suffix: BC,C
next[3] = 0;

P = ABCDABD
j = 4, prefix(P, 4) = ABCD
 True prefix: A,AB,ABC
 True suffix: BCD,CD,D
next[4] = 0;

P = ABCDABD
j = 5, prefix(P, 5) = ABCDA
 True prefix: A,AB,ABC,ABCD
 True suffix: BCDA,CDA,DA,A
next[5] = 1;

P = ABCDABD
j = 6, prefix(P, 6) = ABCDAB
 True prefix: A,AB,ABC,ABCD,ABCDA
 True suffix: BCDAB,CDAB,DAB,AB,B
next[6] = 2;

obtain next Table is:
[-1, 0, 0, 0, 0, 1, 2]

code implementation

package string;

public class KMP {
    //Build next table
    public static int[] buildNext(String sub){
        //Building the next table is to find the maximum length of the true prefix = = true suffix to get the pattern string to move to the right as much as possible
        int[] next = new int[sub.length()];
        //Main string position
        int j = 0;
        //Substring position
        int t = next[0] = -1;

        while (j<sub.length()-1){
            if (t<0||sub.charAt(j)==sub.charAt(t)){
                j++;
                t++;
                next[j] = t;
            }else {
                t = next[t];
            }
        }
        return next;
    }

    public static void kmp(String parent,String sub){
        int[] next = buildNext(sub);
        //Successfully matched location
        int index = -1;
        //Length of main string
        int pLen = parent.length();
        //Length of substring
        int sLen = sub.length();

        if (pLen<sLen){
            System.out.println("Error.The main string is greater than the sub string length.");
            return;
        }

        int i = 0;
        int j = 0;
        while (i<pLen&&j<sLen){
            // Judge whether the characters in the corresponding position are equal
            if (j==-1||parent.charAt(i)==sub.charAt(j)){
                //If equal, the main string and substring continue to compare
                i++;
                j++;
            }else{
                //i unchanged, j=next[j]
                j = next[j];
            }
        }
        //Match successful
        if (j >= sLen) {
            index = i - j;
            System.out.println("Successful match,index is:" + index);
        } else {// Matching failed
            System.out.println("Match failed.");
        }
    }
}

Topics: Java Algorithm