Java String split Trampling

Posted by Ravi Kumar on Mon, 14 Oct 2019 15:34:12 +0200

1.1 The pit of split

A couple of days ago in the company to pass FTP The data files uploaded by this method are parsed in a prescribed format and then put into storage. The general idea of code implementation is as follows: first, read the files by stream, parse and encapsulate each line of data into an object, and then carry out the operation of warehousing. I thought it was a very simple operation, and then after writing the code, I tested myself and found that there were problems in the string segmentation of each line of the file. Here is a simple summary of the record. In Java, split method is often used to segment strings. In some logic of text processing and string segmentation, it is necessary to split and disassemble strings according to certain separators. This functionality, in most cases, uses the split method in String. With regard to this method, it's easy to tread a pit if you don't pay attention to it.

(1) The parameters of split are regular expressions
First of all, a common problem is to forget that String's split method parameters are not ordinary strings, but regular expressions. For example, the following two ways of using the split method can not meet our expectations:

   /**
    * @author mghio
    * @date: 2019-10-13
    * @version: 1.0
    * @description: Java String split
    * @since JDK 1.8
    */
    public class JavaStringSplitTests {

        @Test
        public void testStringSplitRegexArg() {
            System.out.println(Arrays.toString("m.g.h.i.o".split(".")));
            System.out.println(Arrays.toString("m|g|h|i|o".split("|")));
        }

    }

<!-- more -->

The output of the above code is as follows:

[]
[m, |, g, |, h, |, i, |, o]

The reason for the above error is that. and | are regular expressions and should be processed with escape characters:

"m.g.h.i.o".split("\\.")
"m|g|h|i|o".split("\\|")

There are other similar methods in the String class, such as replaceAll.

(2) split ignores the split empty string
In most cases, we will only use split method with one parameter, but split method with only one parameter has a pit: this method will only match to the last valuable place, which will be ignored later, for example:

   /**
    * @author mghio
    * @date: 2019-10-13
    * @version: 1.0
    * @description: Java String split
    * @since JDK 1.8
    */
    public class JavaStringSplitTests {
            
        @Test
        public void testStringSplitSingleArg() {
            System.out.println(Arrays.toString("m_g_h_i_o".split("_")));
            System.out.println(Arrays.toString("m_g_h_i_o__".split("_")));
            System.out.println(Arrays.toString("m__g_h_i_o_".split("_")));
        }

    }

The output of the above code is as follows:

[m, g, h, i, o]
[m, g, h, i, o]
[m, , g, h, i, o]

For example, the second and third output results are actually not in line with our expectations, because the actual fields like some file uploads can usually be empty, and it will be problematic to use the split method with a single parameter for processing. By checking API document Later, it is found that there is another split method with two parameters in String. The second parameter is an integer type variable, which represents the maximum number of matches, and 0 means only matching to the last valuable place. The second parameter of split method for a single parameter is actually 0. If you want to force matching, you can choose to use negative number (usually in-1) instead of the following, and the output is consistent with our expectations.

    "m_g_h_i_o".split("_", -1)      // [m, g, h, i, o]
    "m_g_h_i_o__".split("_", -1)    // [m, g, h, i, o, , ]
    "m__g_h_i_o_".split("_", -1)    // [m, , g, h, i, o, ]

(3) Other API s for String Cutting in JDK
There is also a class called StringTokenizer in JDK that can also cut strings. The usage is as follows:

   /**
    * @author mghio
    * @date: 2019-10-13
    * @version: 1.0
    * @description: Java String split
    * @since JDK 1.8
    */
    public class JavaStringSplitTests {

    @Test
    public void testStringTokenizer() {
        StringTokenizer st = new StringTokenizer("This|is|a|mghio's|blog", "|");
        while (st.hasMoreElements()) {
        System.out.println(st.nextElement());
        }
    }

    }

However, we know from the source javadoc that this has existed since JDK 1.0 and belongs to a legacy class. String split method is recommended.

1.2 JDK Source Exploration

By looking at the source code of String class in JDK, we know that split method (String regex) with two parameters is called in split method (String regex, int limit) of single parameter in String class. The split method with two parameters is divided into strings according to the first parameter regex regular expression passed in, and the second parameter limit is defined after segmentation. When the number of strings exceeds the limit, the first limit-1 substring is segmented properly, and the last substring contains all the remaining characters. The overloading method of a single parameter sets limit to 0. The source code is as follows:

    public String[] split(String regex, int limit) {
        char ch = 0;
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

Next let's look at how String's split method is implemented.

(1) Judgment of special circumstances

    (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
  • When the first parameter regex is a single character, assign it to ch and determine whether it is in the metacharacter: ". $|()[{^?*+"
  • When the first parameter regex is two characters, the first character is\ (to indicate that one needs to be escaped by two\), and the second character is not in numbers, upper and lower case letters and Unicode coding Between Character.MIN_HIGH_SURROGATE ('uD800') and Character.MAX_LOW_SURROGATE ('uDBFF').

(2) String segmentation
For the first segmentation, use off and next, which point to the starting position of each segmentation, next points to the subscript of the separator, and updates the value of off after one segmentation. When the size of list equals limit - 1, add the remaining substrings directly.

  • If the string does not contain a separator, it returns the original string directly.
  • If the number of strings does not reach limit - 1 after the first segmentation, the remaining strings are added the second time.
  • If the second parameter limit is equal to 0, it moves forward from the last string and clears all empty strings ("").

(3) Regular matching
String's split method uses two classes of Pattern s and Matcher to segment and match, and both classes are called for regular operations in String.

  • The pattern class can be understood as a pattern class. It is mainly used to create a matching pattern. Its construction method is private and can not create the object directly. It can create a regular expression through a simple factory method of Pattern.complie(String regex).
  • Matcher class can be understood as a matcher class, which is used to explain the Pattern class to perform string matching operations engine, its construction method is private, can not directly create the object, can be obtained through the Pattern.matcher(CharSequence input) method instance of this class. The two-parameter split method of String class finally uses the compile and split methods of Patternclass, as follows:
    return Pattern.compile(regex).split(this, limit);

First, call the static method compile of the Pattern class to get the Pattern pattern class object

    public static Pattern compile(String regex) {
        return new Pattern(regex, 0);
    }

Next, we call the split(CharSequence input, int limit) method of Pattern. In this method, the matcher CharSequence input method returns an instance m of a Matcher matching class, which is similar to the special case of the method in the Matcher class.

  • Using m.find(), m.start(), m.end() methods
  • Update the position of start and end whenever a partitioner is found
  • It then handles the case where no separator is found, the number of substrings is less than limit, and limit = 0

1.3 Other String Segmentation Methods

  • Way 1: Use org.apache.commons.lang3.StringUtils#split, which uses a complete string as a parameter rather than a regular expression. The underlying call to the splitWorker method (<font color="#dd0000"> Note: </font> This method ignores the split empty string)
   /**
    * @author mghio
    * @date: 2019-10-13
    * @version: 1.0
    * @description: Java String split
    * @since JDK 1.8
    */
    public class JavaStringSplitTests {

        @Test
        public void testApacheCommonsLangStringUtils() {
            System.out.println(Arrays.toString(StringUtils.split("m.g.h.i.o", ".")));
            System.out.println(Arrays.toString(StringUtils.split("m__g_h_i_o_", "_")));
        }

    }

Output results:

[m, g, h, i, o]
[m, g, h, i, o]
  • Mode 2: Using com.google.common.base.Splitter and splitter provided in the Google Guava package, it provides a richer method for processing segmentation results, such as removing blanks before and after the results, removing empty strings, etc.
   /**
    * @author mghio
    * @date: 2019-10-13
    * @version: 1.0
    * @description: Java String split
    * @since JDK 1.8
    */
    public class JavaStringSplitTests {

        @Test
        public void testApacheCommonsLangStringUtils() {
            Iterable<String> result = Splitter.on("_").split("m__g_h_i_o_");
            List<String> resultList = Lists.newArrayList();
            result.forEach(resultList::add);
            System.out.println("stringList's size: " + resultList.size());
            result.forEach(System.out::println);
        }

    }

Output results:

stringList's size: 7
m

g
h
i
o

1.4 summary

In String classes, besides split methods, methods with regular expression interfaces are implemented by calling Pattern (pattern class) and Matcher (matcher class). Every keyword of JDK source code, such as final and private, is designed very strictly. Reading javadoc in classes and methods more often, paying more attention to these details is very helpful for reading and writing your own code.

Topics: Java JDK Google ftp