The minimum necessary knowledge of regular expressions

Posted by MasterACE14 on Sun, 02 Jan 2022 08:01:27 +0100

As a programming language in a specific field, regular expression has an unshakable position in the field of string processing. By using regular expressions, we can use simple syntax to realize complex functions such as string format verification, string segmentation, string replacement, string search and so on. The emergence of regular expressions greatly simplifies the process of complex string processing.

Here is a summary of the basic syntax knowledge of regular expressions, which is convenient for everyone to quickly master regular expressions. After learning these basic grammar knowledge, 99% of the string complex processing is no problem. Why not try it.

The basic syntax of regular expressions is mainly divided into the following parts

1. Metacharacter, similar to keyword in programming language

2. Regular expression matching pattern

3. Grouping and reference

4. Various functional modes

5. Various positioning anchor points

6. Escape use of various metacharacters

1. Metacharacter

Metacharacters refer to those special characters with special meaning in regular expressions. They are somewhat similar to keywords in programming languages and must be remembered.

1. Metacharacter used to match a single character

. Match any character(Except line breaks)
\d Match any number  \D Match any non number
\w Match any alphanumeric underscore \W Match any non alphanumeric underscores
\s Match any blank   \S Match any non whitespace character

2. Metacharacter used to match whitespace

\r Carriage return
\n Newline character
\f Page feed
\t tab stop
\v Vertical tab stop
\s Any blank character

3. Metacharacter indicating quantity

* Match 0 to more than once
+ Match 1 to more than once
? Match 0 to 1 times
{m} appear m second
{m,} At least m second
{m,n} appear m reach n second

4. Metacharacter representing matching range

| or,as ab|bc representative ab perhaps bc
[...]Choose one more,Any single element in parentheses
[a-z]matching a reach z Any single element between,contain a,z
[^...]Reverse,Cannot be any single element in parentheses

2. Regular expression matching pattern

Regular expressions have three main matching patterns: 1 Greedy mode 2 Non greedy mode 3 Exclusive mode.

Greedy mode is to match as long as possible, while non greedy mode will match as short as possible. Greedy mode is started by default. To start non greedy mode, you need to add modifier after the corresponding quantifier?

ceshi*    Greedy matching
ceshi*?   Non greedy matching

Whether greedy mode or non greedy mode, backtracking is required to complete the corresponding functions. However, in some scenarios, we don't need backtracking. It's good if the matching doesn't return failure. Therefore, there is another pattern in the regularization, the exclusive pattern, which is similar to greedy matching, but the matching process will not backtrack, so the performance will be better in some cases.

The exclusive mode is very similar to the greedy mode. The exclusive mode will match as much as possible. If the matching fails, it will end without backtracking. In this way, it will save time. The specific method is to add a plus sign (+) after the quantifier. python and Go currently do not support exclusive mode.

When the quantifier metacharacter Plus + (English plus sign) meets the requirements, the matching shall be taken as long as possible without backtracking. If the matching is not good, it will fail.

^Represents beginning with regular
$Represents a regular ending

abb{1,3}ab Greedy model
abb{1,3}?ab Non greedy model
abb{1,3}+ab Exclusive Mode

3. Grouping and referencing

In regular expressions, a part of the matching content is treated as a group and treated as a whole through (). Reference it in subsequent matches or substitutions. The biggest use of parentheses in regular expressions is grouping.

Grouping is divided into non saving subgroups and saving subgroups. Saving subgroups is for reuse, but not saving subgroups is simply regarded as a whole, which can improve the performance of the program. Use?: To define not to save subgroups

Save subgroup  (regular)        \d{8}(\d{3})?
Do not save subgroups (?:regular)     \d{8}(?:\d{3})?

Regular expressions are grouped by the order of parentheses. In the case of nested parentheses, we can determine which group the current group is by the number of left parentheses.

Many times, changing the number of parentheses will affect the group number, so we can use named groups to give a name to the group, which is easier to find and less prone to error. The format of the named group is as follows

(?P<Group name>regular)

It should be noted that the way of naming the group just mentioned is the same as before. Assign a number to the group, but you can use the name instead of the number. In fact, the number of the named group has been assigned. However, named grouping is not supported by all languages.

In most scenarios, we use backslash + number to refer to groups, but in some scenarios, we can also use $+ number to refer to groups

\1 Reference the first group
$1 Reference the first group,notepad++H and JavaScript This is the way of reference

The reference notation of regular expression named groups is different in different languages
        When searching         When replacing
.NET   \k<name>      ${name}
PHP    (?P=name)      I won't support it
Python (?P=name)     \g<name>
Ruby   \k<name>      \k<name>

4. Various functional modes

The function patterns introduced here are different from the matching patterns and target different fields. The patterns introduced here are similar to the function switches of regular expressions. When this function mode is turned on, regular expressions can do more things. Mainly divided into: 1 Case insensitive mode 2 Single line matching mode 3 Multiline matching pattern 4 Annotation mode.

1. Case insensitive mode

When we put the pattern modifier (? i) in front of the entire regular expression, it means that the entire regular expression is case insensitive.

(?i)regular expression 

(?i)(dog) \1  Match duplicate dog Case insensitive(If the case of the first and the second is inconsistent, they can also be matched)
((?i)dog) \1 Matching is not case sensitive(The first and second cases are the same)

If the modifier is within the bracket, the scope of action is the regular expression within the bracket, not the whole regular expression;

2. Single line matching mode

Single line matching mode, in which (.) You can match any character including a newline character. The mode modifier is (? s)

(?s).+

Note that this mode is not supported by JavaScript.

3. Multiline mode

The function of multi line mode is that ^ and $can match the beginning or end of each line. The mode modifier is (? m)

(?m)^the | cat$

There are also two metacharacters \ A and \ Z (Python is \ z), which are easy to be confused. A only matches the beginning of the whole string and Z only matches the end of the whole string.

4. Annotation mode

The modifier of annotation pattern is (? #comment). Annotation pattern is used to annotate complex regular expressions for easy understanding.

(\w+)(?#word) \1(?#word show again)

5. Various positioning anchor points

Anchors in regular expressions are used to match positions, not the text content itself

1. Word boundary

In regular expressions, we use \ b to represent the boundary of words.

\bdog with dog First word
dog\b with dog Ending word
\bdog\b Only dog New words
 Common word matching\b\w+\b

2. Start and end of line

Line position is defined by ^ and $^ at the beginning of the matching line and $at the end of the matching line

3. Look around

Look around is a rule that requires the front or back of the matching part to meet (or not meet)

(?<=Y)   On the left is a regular expression Y Corresponding string
(?<!=Y)  The left is not a regular expression Y Corresponding string
(?=Y)    On the right is the regular expression Y Corresponding string
(?!Y)    The right is not a regular representation Y Corresponding string

Angle brackets represent the left, no angle brackets represent the right, and the exclamation point means yes or no.

(?<!\d)[1-9]\d{5}(?!\d)The left is not a number, the right is not a number, and the middle is 1~9 The first number(Postal Code)
(?<=\W)\w+(?=\W)Words with non characters on the left and non characters on the right

Although there are parentheses in the look, they will not be saved as subgroups. Generally, what is saved as a subgroup is the matched text content, which is used for subsequent operations such as replacement, and the look around indicates the requirements for the left and right environment of the text, that is, the look only matches the position and does not match the text content.

6. Escape use of various metacharacters

When using a string in programming, when the escape character is placed in the character sequence, it will replace and interpret its subsequent characters. Generally, whether a character is an escape character is determined by the context. The escape character is the character that marks the beginning of the escape sequence.

Regular expressions are also escaped by backslashes

//Escape of metacharacters
\*  \+  \?  \(  \)  \[ \] \{ \}

There are three situations in a character group that need to be escaped

1. The caret is in square brackets and needs to be escaped in the first position

[^ab] Representing non before escape
[\^ab] Represents a normal character after escape

2. The middle dash is in brackets and not at the beginning and end

[a-z]Representative scope
[-ac] No escape is required at the beginning
[ac-]No escape is required at the end
[a\-z]Escape is required in the middle

3. The right bracket is in the square bracket and not in the first place

[]ab] The closing bracket is not escaped,In the first place
[a]b]The closing bracket is not escaped,Not in the first place
[a\]b] Represents a normal character after escape

Generally speaking, if we want to express metacharacters as their literal meaning, we need to escape them, but if they are in the symbol in the character group, they can not be escaped.

In the character array, generally single character metacharacters, such as. * +? () and so on, they no longer have special meaning, but represent the character itself. However, if double sign metacharacters such as \ d or \ w appear in brackets, they are still the meaning of the metacharacter itself.

After mastering the above six points, you can master the basic content of regular expressions. The rest is to practice more. Like other programming languages, regular expression learning is practice makes perfect

7. Common regular expressions

The following summarizes the regular expressions commonly used in string matching for your reference and use

Match various types

[-+]?\d+(?:\.\d+)?   Matches positive, negative, and decimal numbers
[1-9]\d*|0  Matching nonnegative integers
-[1-9]\d*|0 Matching non positive and negative numbers
-?\d+(?:\.\d+)?|\+?(?:\d+(?:\.\d+)? | \.\d+Matching floating point numbers

ID card No.

[1-9]\d{14}(\d\d[0-9Xx])?

Postal Code

(?<!\d)[1-9]\d{5}(?!\d)

Chinese characters

[\u4E00-\u9FFF]

Address matching IPV4

(?:1\d\d|2[0-4]\d|25[0-5]|0{0,2}\d)(?:\.(?:\.1\d\d|
|2[0-4]\d|25[0-5]|0?[1-9]\d|0{0,2}\d)){3}

Matching time

2021-06-25
\d{4}-(?:1[0-2]|0?[1-9])-(?:[12]\d|3[01]|0?[1-9])
23:30
(?:2[0-3]|1\d|0?\d):(?:[1-5]\d|0?\d)

mailbox

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

Programmer Think

The minimum necessary knowledge of regular expressions

1. Metacharacter

2. Regular expression matching pattern

3. Grouping and referencing

4. Various functional modes

5. Various positioning anchor points

6. Escape use of various metacharacters

7. Common regular expressions

Hot Topics