Java self scholar's fourth learning note

Posted by Scud on Thu, 20 Jan 2022 11:20:37 +0100

Write about regular expressions. This knowledge is very helpful for programming.

The grammar defines a class of strings. Suppose we write a syntax that represents a uniform resource locator (URL). This syntax represents the collection of legal URLs in the HTTP protocol.

A string in literal form in syntax is called a terminal. It is so called because it cannot be further extended. We usually identify such strings in single quotes.

The syntax is described by a class composed of generators (production), in which each generator defines a non terminal. A syntax generator has the following forms:

nonterminal ::= expression of terminals, nonterminals, and operators

A syntax has a unique non terminator called root, and all strings it contains are generated from the root.

Syntax operator

1. Closure operation, represented by *:

x ::= y*        // x matches zero or more y

2. Concatenation is represented by only one space:

x ::= y z       // x matches y followed by z

3. Union, represented by |:

x ::= y | z     // x matches either y or z

4. Yes or no (0 or 1 occurrence), by? express:

x ::= y?       // an x is a y or is the empty string

5. Yes (1 or more occurrences), represented by +:

x ::= y+       // an x is one or more y
               //    equivalent to  x ::= y y*

6. exact number of occurrences, range of occurrences:

x ::= y{3}     // an x is three y
               // equivalent to x ::= y y y 

x ::= y{1,3}   // an x is between one and three y
               // equivalent to x ::= y | y y | y y y

x ::= y{,4}    // an x is at most four y
               // equivalent to x ::=   | y | y y | y y y | y y y y
               //                     ^--- note the empty string here, so this can match zero y's

x ::= y{2,}    // an x is two or more y
               // equivalent to x ::= y y y*

7. What are:

① character class

x ::= [aeiou]  // equivalent to  x ::= 'a' | 'e' | 'i' | 'o' | 'u'
x ::= [a-ckx-z]    // equivalent to  x ::= 'a' | 'b' | 'c' | 'k' | 'x' | 'y' | 'z'

② inverted character class

x ::= [^a-c]  // equivalent to  x ::= 'd' | 'e' | 'f' | ... | '0' | '1' | '2' | ... | '!' | '@'
              //                          | ... (all other possible characters)

Recursion in grammar and syntax analysis tree are completely consistent with those in formal language theory, which will not be repeated here.

Regular Expressions can be abbreviated as regexes. Due to simplification, regex is less readable than the original syntax, for example:

// original grammer
url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

// regex
url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/'

regex has some more special symbols:

.   // matches any single character (but sometimes excluding newline, depending on the regex library)

\d  // matches any digit, same as [0-9]
\s  // matches any whitespace character, including space, tab, newline
\w  // matches any word character including underscore, same as [a-zA-Z_0-9]

The backslash (\) can be understood as an Escape, that is, skip the next character and match it directly with the text.

\.  \(  \)  \*  \+  \|  \[  \]  \\

Note that when you copy the above symbol with backslash into IDEA, a backslash will be added in front of it. Don't delete it! The specific reason is described in MIT Reading (taking \. As an example):

We want to match a literal period ., so we have to first escape it as \. to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as \\. to protect the backslash from being interpreted as a Java string escape character. The frequent necessity for double-backslash escapes makes regexes still less readable.

In Java, you can use regular expressions to manipulate strings.

Replace all consecutive spaces in string s with a single space:

String singleSpacedString = s.replaceAll(" +", " ");

Matching URL:

if (s.matches("http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/")) {
    // then s is a url
}

Date of withdrawal such as "2021-07-09":

String s = "2021-07-09";
Pattern regex = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher m = regex.matcher(s);
if (m.matches()) {
    String year = m.group("year");
    String month = m.group("month");
    String day = m.group("day");
    // Matcher.group(name) returns the part of s that matched (?<name>...)
}

Topics: Java