python regular expression I

Posted by sbourdon on Wed, 09 Feb 2022 08:37:39 +0100

Introduction to regular expressions

Regular expression is a special character sequence, which is usually used to retrieve and replace the text that conforms to a certain pattern (rule).
re module is used in python.


  1. Very flexible, logical and functional
  2. It can quickly achieve the complex control of string in a very simple way

Regular expressions in python

import re

word = 'h230dsdkf349dsd0)'  # Take the number inside
print(re.findall(r'\d+', word)) # r is to invalidate the escape character. If you don't use r, you need to add one more\
print(re.sub(r'\d', "@", word))
['230', '349', '0']

Regular search

1. search method

import re

word = 'h2303dsdkf349dsd0)'  # Take the number inside
# word.index('3')  # String method can only find one character
result ='3\d+', word)
# \d represents any number + represents one or more occurrences
result ='3\d', word)
<re.Match object; span=(2, 5), match='303'>
<re.Match object; span=(2, 4), match='30'>

2. match method

match matches from the beginning of the string

print(re.match(r'3\d+', word))
print(re.match(r'h\d+', word))
<re.Match object; span=(0, 5), match='h2303'>

fullmatch complete match

print(re.fullmatch(r'h\d+', word))
print(re.fullmatch(r'h\d+', 'h12312324'))
<re.Match object; span=(0, 9), match='h12312324'>

3. findall / finditer matches multiple

x = re.finditer(r'3\d+', word)  # The result is an iterator
for i in x:
<re.Match object; span=(2, 5), match='303'>
<re.Match object; span=(10, 13), match='349'>
print(re.findall(r'3\d+', word)) # The result is a list
['303', '349']

re.Match type

# The results obtained by match / search / fullmatch / finder of re module are re Match type
x1 ='\d+', 'ab12cd34ef567g8901')
['__class__', '__class_getitem__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
print(x1.end())  # Gets the end index of the matched string
print(x1.endpos)  # Gets the length of the entire string
print(x1.span())  # The (start, end) index of the matched string
print(x1.string)  # The full string found
print(  # Get the matching string
(2, 4)

Grouping of regular expressions

import re

x1 ='b\d+cd\d+ef\d+g\d+', 'ab12cd34ef567g8901')
x2 ='(b\d+)(cd\d+)(ef\d+)(g\d+)', 'ab12cd34ef567g8901')

# Although the above two results are the same, there is only one group in the top and several groups in the bottom
print(  # group(0) is equivalent to group()
print(x2.groups())  # Get a tuple composed of strings matched by all subgroups
('b12', 'cd34', 'ef567', 'g8901')

Regular expressions use (? P < name >...) syntax to name groups

x3 ='(?P<name1>ab\d+)(?P<name2>cd\d+)(?P<name3>ef\d+)(?P<name4>g\d+)', 'ab12cd34ef567g8901')

print(x3.lastgroup)  # Get the group name of the last group
print(x3.groupdict())  # Save the grouped data with group name in the form of dictionary
{'name1': 'ab12', 'name2': 'cd34', 'name3': 'ef567', 'name4': 'g8901'}


Can replace regular rules

patten = re.compile(r"ab\d+")

Regular modifier

print('a', 'ABC'))
print('a', 'ABC', re.I)) # re.I is the regular modifier
<re.Match object; span=(0, 1), match='A'>
Regular modifiermeaning
re.Are.ASCIIMatch ASCII only
re.Ire.IGNORECASEIgnore case matching
re.Lre.LOCALEThe \ w, \W, \b, \B and case sensitive matches are determined by the current language region
re.Mre.MULTILINEThe beginning of each line and the character immediately after '^' are matched; The style character '$' matches the end of the string and the end of each line (the symbol before the newline character). By default, '^' matches the beginning of the string and '$' matches the end of the string.
re.Sre.DOTALLLet's go Special characters match any character, including line breaks
re.Xre.VERBOSEWhen there are # non character sets and escape sequences in a line, all characters after it are comments

Regular matching rule

  1. Numbers and letters represent itself and have no special meaning
  2. \It has special meaning and is used to escape
  3. Most letters have a special meaning after they are preceded by \ escape
  4. Most punctuation marks have special meanings
  5. If you want to represent the punctuation itself, you need to use the \ escape character

Special meaning of escape letter

\wMatch alphanumeric underscores
\WMatch non numeric alphabetic underscores
\sMatch any white space character, equivalent to [\ n\r\f].
\SMatch any non null character
\dMatch any number, equivalent to [0-9].
\DMatch any non number
\AStart of matching string
\ZThe matching string ends. If there is a newline, only the end string before the newline is matched.
\zEnd of matching string
\GMatch the position where the last match was completed.
\bMatch a word boundary, that is, the position between the word and the space. For example, 'er\b' can match 'er' in 'never', but not 'er' in 'verb'.
\BMatches non word boundaries‘ er\B 'can match' er 'in' verb ', but cannot match' er 'in' never '.
\n. \ T, wait.Match a newline character. Match a tab, etc
\1...\9Match the contents of the nth group.
\10Match the content of the nth group if it is matched. Otherwise, it refers to the expression of octal character code.
print(re.findall(r'\s', 'hello world, hello python'))
[' ', ' ', ' ']

Special meaning of punctuation

‘_’ No longer escaped
Only characters with special meaning in regular expressions are escaped. So, '!'‘ ','% ',' '', ',' / ',': ','; ',' < ',' = ',' > ',' @ 'and' ` 'will no longer be escaped.