Python 3 regular expression

Posted by ravnen on Sun, 06 Mar 2022 14:28:40 +0100

re module

1,re.match function

re.match attempts to match from the starting position of a string. If it cannot match from the starting position, it returns None.

Syntax: re match(pattern, string, flags=0)

Parameters:

pattern -- regular expression

String -- matching string

flags -- flag bit, which is used to control the matching method of regular expressions, such as case sensitivity, multi line matching, etc. it is an optional flag. Common values are as follows:

Multiple flags can be specified by bitwise OR(|) them. Such as re I | re. M is set to I and M flags:

Modifier describe
re.IMake matching pairs case insensitive
re.LDo local aware matching
re.MMulti line matching, affecting ^ and$
re.SMake Matches all characters, including line breaks
re.UParses characters according to the Unicode character set. This flag affects \ w, \W, \b, \B
re.XThis flag gives you a more flexible format so that you can write regular expressions easier to understand.

Return value: if the matching is successful, the matching object is returned; otherwise, None is returned.

You can use group(num) or groups() to get matching objects:

Matching object methoddescribe
group(num=0)group() can enter multiple group numbers at a time, in which case it will return a tuple containing the values corresponding to those groups.
groups()Returns a tuple containing all group strings, from 1 to the contained group number.

example:

import re

strlist = re.match('www', 'www.com.cn', flags=0)

print(strlist)
print(type(strlist))
print(strlist.span())
print(strlist.group(0))
print(strlist.groups())

Output:

<re.Match object; span=(0, 3), match='www'>
<class 're.Match'>
(0, 3)
www
() 

2,re.search function

Scan the string and return the first string that matches successfully.

Syntax: re search(pattern, string, flags=0)

Parameters:

pattern -- regular expression

String -- matching string

flags -- flag bit, and re Match is the same

Return value: if the matching is successful, the matching object is returned; otherwise, None is returned.

example:

import re

#Match email address
line="abc test@baidu.com.cn mm"
pattern = r'\b[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}\b'
matchobj = re.search(pattern, line, flags=0)
print(type(matchobj))
print(matchobj.group(0))

Output:

<class 're.Match'>
test@baidu.com.cn

3,re.match and re The difference between search:

re.match matches from the beginning of the string. If it cannot match at the beginning, it returns None;

re.search matches the entire string until a matching string is found.

4. Retrieval and replacement

re.sub is used to replace matches in the string

Syntax: re sub(pattern, repl, string, count=0, flags=0)

Parameters:

pattern -- regular expression

repl -- replace the string, which can also be used as a function

String -- matching string

count -- the maximum number of substitutions after pattern matching, and 0 represents all substitutions matched. Optional

flags -- flag bit, optional

example:

import re

#replace
phone = "This is a telephone number: 1234-556-778 #The place of ownership is xx“
#Match phone number
patt1 = r'[0-9]{4}-[0-9]{3}-[0-9]{3}'
phone_no = re.search(patt1, phone, flags=0)
print("The telephone number is:",phone_no.group(0))
#Replace-
patt2 = r'\D'
phone_num = re.sub(patt2, "", phone_no.group(0), count=0, flags=0)
print(type(phone_num))

Output:

Telephone number: 1234-556-778
<class 'str'>
Tel: 1234556778

If repl is a function, instance:

import re

#Replace with function
# Multiply the matching number by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)


s = 'A23G4HFD567'
#? P < value > means to name a group named value, and the matching rule complies with the following / d+
print(re.sub('(?P<value>\d+)', double, s))

Output:

A46G8HFD1134

5,re.compile

The compile function is used to compile regular expressions and generate a regular expression (Pattern) object for use by the match() and search() functions.

Syntax: re compile(pattern[,flags])

Parameters:

pattern -- regular expression

flags -- optional, indicating the matching mode, such as ignoring case, multiline mode, etc. The details are as follows:

    • re.I ignore case
    • re.L indicates that the special character set \ w, \W, \b, \B, \s, \S depends on the current environment
    • re.M multiline mode
    • re. 's. ' And any character including newline character ('.' does not include newline character)
    • re.U indicates that the special character set \ w, \W, \b, \B, \d, \D, \s, \S depends on the Unicode character attribute database
    • re.X to increase readability, ignore spaces and comments after '#'

Example:

>>>import re
>>> pattern = re.compile(r'\d+')                    # Used to match at least one number
>>> m = pattern.match('one12twothree34four')        # Find header, no match
>>> print( m )
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Match from the position of 'e', no match
>>> print( m )
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Match from the position of '1', just match
>>> print( m )                                        # Match object returns a
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 0 can be omitted
'12'
>>> m.start(0)   # 0 can be omitted
3
>>> m.end(0)     # 0 can be omitted
5
>>> m.span(0)    # 0 can be omitted
(3, 5)

Above, when the Match is successful, a Match object is returned, where:

  • The group([group1,...]) method is used to obtain one or more group matching strings. When you want to obtain the whole matching substring, you can directly use {group() or} group(0);
  • The start([group]) method is used to obtain the starting position (the index of the first character of the substring) of the substring matched by the group in the whole string. The default value of the parameter is 0;
  • The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring + 1), and the default value of the parameter is 0;
  • span([group]) method returns (start(group), end(group)

Another example shows that there is no error reporting:

>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I means ignore case
>>> m = pattern.match('Hello World Wide Web')
>>> print( m )                            # If the Match is successful, a Match object is returned
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # Returns the entire substring that matches successfully
'Hello World'
>>> m.span(0)                             # Returns the index of the entire substring that matches successfully
(0, 11)
>>> m.group(1)                            # Returns the substring of the first packet matching success
'Hello'
>>> m.span(1)                             # Returns the index of the substring of the first packet matching success
(0, 5)
>>> m.group(2)                            # Returns the substring of the second packet matching success
'World'
>>> m.span(2)                             # Returns the substring index of the second packet matching success
(6, 11)
>>> m.groups()                            # Equivalent to (m.group(1), m.group(2),...)
('Hello', 'World')
>>> m.group(3)                            # There is no third group
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

 6,findall

Find all substrings matched by the regular expression in the string and return a list. If there are multiple matching patterns, return the tuple list. If no matching pattern is found, return the empty list.

Syntax: re Findall (pattern, string, flags = 0) or pattern,findall(string[, pos[, endpos]])

Parameters:

pattern -- regular expression

String -- matching string

flags -- flag bit

pos -- optional parameter that specifies the starting position of the string. The default value is 0.

endpos -- optional parameter, used to specify the end position of the string. The default is the length of the string.

example:

import re
 
result1 = re.findall(r'\d+','runoob 123 google 456')
 
pattern = re.compile(r'\d+')   # Find number
result2 = pattern.findall('runoob 123 google 456')
result3 = pattern.findall('run88oob123google456', 0, 10)
 
print(result1)
print(result2)
print(result3)

Output:

['123', '456']
['123', '456']
['88', '12'] 

Multiple matching patterns, return tuple list:

import re

result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)

Output:

[('width', '20'), ('height', '10')]

 7,re.finditer

Similar to findall, all substrings matched by the regular expression are found in the string and returned as an iterator.

Format: re finditer(pattern, string, flags=0)

Parameters:

pattern -- regular expression

String -- string to match

flags -- matching mode, optional

example:

import re

#finditer
it = re.finditer(r"\d+","12a32bc43jf3")
for match in it:
    print (match.group() )

Output:

12
32
43
3

8,re.split

The split method splits the string according to the substring that can be matched and returns the list.

Syntax: re split(pattern, string[, maxsplit=0, flags=0]) 

Parameters:

pattern -- regular expression

String -- string to match

Maxplit -- split times, maxplit = 1 split once, default to 0, unlimited times

flags -- matching mode, optional

example:

import re

#split
print(re.split('\W+', 'runoob, runoob, runoob.'))
print( re.split('(\W+)', ' runoob, runoob, runoob.'))
print(re.split('\W+', ' runoob, runoob, runoob.', 1))
print(re.split('a*', 'hello world'))

Output:

['runoob', 'runoob', 'runoob', '']
['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']
['', 'runoob, runoob, runoob.']
['', 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '']

Topics: Python Front-end Back-end regex