How Java splits strings

Posted by mcl on Wed, 22 Dec 2021 05:19:16 +0100

If there is such a string of character sequence "silent King II, an interesting programmer", it needs to be split according to the Chinese comma "," which means that the first string of character sequence is "silent King II" before the comma and the second string of character sequence is "an interesting programmer" after the comma.
You can simply use the split() method of the String class.
However, before splitting, you should first check to determine whether this string of characters contains commas, otherwise you should throw an exception.

public static void main(String[] args) {
        String str = "Tossing Xiaofei, an interesting programmer";
        if (str.contains(",")) {
            String [] parts = str.split(",");
            System.out.println("Part I:" + parts[0] +" Part II:" + parts[1]);
        } else {
            throw new IllegalArgumentException("The current string does not contain a comma");
        }
    }
// Output: Part I: tossing Xiaofei Part II: an interesting programmer

This is based on the fact that the string is determined, and the most important thing is that the delimiter is determined. Otherwise, trouble will come. " I said, "there are about 12 kinds of English special symbols. If you directly replace the separator (Chinese comma) in the above code with these special symbols, the following errors will occur when this program runs.

1. Backslash \(ArrayIndexOutOfBoundsException)
2. Insert symbol ^(Ibid.)
3. Dollar sign $(Ibid.)
4. Tease .(Ibid.)
5. Vertical line |(Normal, no error)
6. question mark ?(PatternSyntaxException)
7. asterisk *(Ibid.)
8. plus +(Ibid.)
9. Left or right parenthesis ()(Ibid.)
10. Left or right square brackets [](Ibid.)
11. Left brace or right brace {}(Ibid.)

What should I do? We can use regular expressions

I found an open source regular expression learning document on GitHub, which is very detailed. When you start writing regular expressions, you will inevitably feel very unfamiliar. You can check this document. It doesn't matter if you don't remember. Check it when you encounter it.

https://github.com/cdoco/learn-regex-zh

In addition to this document, there is also one:

https://github.com/cdoco/common-regex

To use English commas Replace the separator.

String str = "Tossing Xiaofei.An interesting programmer";
if (str.contains(".")) {
    String [] parts = str.split("\\.");
    System.out.println("Part I:" + parts[0] +" Part II:" + parts[1]);
} else {
    throw new IllegalArgumentException("The current string does not contain a comma");
}

Because English commas are special symbols, you need to use the regular expression \ \. When using the split() method It can't be used directly.

Why use two backslashes?
Because the backslash itself is a special character, it needs to be escaped with the backslash.

Of course, you can also use [] to wrap the English comma ".", [] is also a regular expression used to match any character contained in parentheses.

str.split("[.]");

In addition, you can use the quote() method of the Pattern class to wrap the English comma ".", This method returns a string wrapped with \ Q\E.

String [] parts = str.split(Pattern.quote("."));

When the parameter of the split() method is a regular expression, the method will eventually execute the following line of code:

return Pattern.compile(regex).split(this, limit);

This means that you have a new choice to split strings. Instead of using the split() method of the String class, you can directly use the following method.

private static Pattern twopart = Pattern.compile("\\.");

public static void main(String[] args) {
    String[] parts = twopart.split("Silent King II.An interesting programmer");
    System.out.println("Part I:" + parts[0] + " Part II:" + parts[1]);
}

Why declare the Pattern expression static?
Because the mode is determined, the efficiency of the program can be improved through the precompiling function of static.
In addition, you can also use Pattern and Matcher class to split strings. The advantage of this is that you can impose some strict restrictions on the strings to be split.
Look at this sample code.

public class TestPatternMatch {
    /**
     * Use precompiled function to improve efficiency
     */
    private static Pattern twopart = Pattern.compile("(.+)\\.(.+)");

    public static void main(String[] args) {
        checkString("Silent King II.An interesting programmer");
        checkString("Silent King II.");
        checkString(".An interesting programmer");
    }

    private static void checkString(String str) {
        Matcher m = twopart.matcher(str);
        if (m.matches()) {
            System.out.println("Part I:" + m.group(1) + " Part II:" + m.group(2));
        } else {
            System.out.println("Mismatch");
        }
    }
}

Regular expression (. +) \ \ (. +) means that not only should the string be divided into two parts according to the English punctuation, but also there should be content before and after the English comma.

Program output:

Part I: silent King Part II: an interesting programmer
 Mismatch
 Mismatch

However, using Matcher to match some simple strings is relatively heavy. Using split() of String class is still the first choice, because this method also has some other awesome functions. For example, if you want to wrap the separator in the first part of the split String, you can do this:

String cmower = "Silent King II, an interesting programmer";
if (cmower.contains(",")) {
    String [] parts = cmower.split("(?<=,)");
    System.out.println("Part I:" + parts[0] +" Part II:" + parts[1]);
}

The results of the program output are as follows:

Part I: silent King II, Part II: an interesting programmer

You can see that the separator "," is wrapped in the first part. If you want to wrap in the second part, you can do this:

String [] parts = cmower.split("(?=,)");

?<= And= What is it?
It is actually an assertion pattern in regular expressions

The split() method can pass two parameters, the first is the separator, and the second is the number of split strings

String cmower = "Silent King II, an interesting programmer, dotes on him";
if (cmower.contains(",")) {
    String [] parts = cmower.split(",", 2);
    System.out.println("Part I:" + parts[0] +" Part II:" + parts[1]);
}

After entering the debug mode, you can see the following:

That is, when two parameters are passed, substring() will be called directly to intercept, and the after the second separator will not be split.

Results of program output:

Part I: silent King Part II: an interesting programmer, dote on him

Topics: Java regex