python regular expression I

Posted by sbourdon on Wed, 09 Feb 2022 08:37:39 +0100

Introduction to regular expressions

Regular expression is a special character sequence, which is usually used to retrieve and replace the text that conforms to a certain pattern (rule).
re module is used in python.

characteristic

Very flexible, logical and functional
It can quickly achieve the complex control of string in a very simple way

Regular expressions in python

import re

word = 'h230dsdkf349dsd0)'  # Take the number inside
print(re.findall(r'\d+', word)) # r is to invalidate the escape character. If you don't use r, you need to add one more\
print(re.sub(r'\d', "@", word))

['230', '349', '0']
h@@@dsdkf@@@dsd@)

Regular search

1. search method

import re

word = 'h2303dsdkf349dsd0)'  # Take the number inside
# word.index('3')  # String method can only find one character
result = re.search(r'3\d+', word)
print(result)
# \d represents any number + represents one or more occurrences
result = re.search(r'3\d', word)
print(result)

<re.Match object; span=(2, 5), match='303'>
<re.Match object; span=(2, 4), match='30'>

2. match method

match matches from the beginning of the string

print(re.match(r'3\d+', word))
print(re.match(r'h\d+', word))

None
<re.Match object; span=(0, 5), match='h2303'>

fullmatch complete match

print(re.fullmatch(r'h\d+', word))
print(re.fullmatch(r'h\d+', 'h12312324'))

None
<re.Match object; span=(0, 9), match='h12312324'>

3. findall / finditer matches multiple

x = re.finditer(r'3\d+', word)  # The result is an iterator
for i in x:
    print(i)

<re.Match object; span=(2, 5), match='303'>
<re.Match object; span=(10, 13), match='349'>

print(re.findall(r'3\d+', word)) # The result is a list

['303', '349']

re.Match type

# The results obtained by match / search / fullmatch / finder of re module are re Match type
x1 = re.search(r'\d+', 'ab12cd34ef567g8901')
print(dir(x1))

['__class__', '__class_getitem__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']

print(x1.end())  # Gets the end index of the matched string
print(x1.endpos)  # Gets the length of the entire string
print(x1.span())  # The (start, end) index of the matched string
print(x1.string)  # The full string found
print(x1.group())  # Get the matching string

4
18
(2, 4)
ab12cd34ef567g8901
12

Grouping of regular expressions

import re

x1 = re.search(r'b\d+cd\d+ef\d+g\d+', 'ab12cd34ef567g8901')
print(x1.group())
x2 = re.search(r'(b\d+)(cd\d+)(ef\d+)(g\d+)', 'ab12cd34ef567g8901')
print(x2.group())

# Although the above two results are the same, there is only one group in the top and several groups in the bottom
print(x2.group(1))
print(x2.group(2))
print(x2.group(0))  # group(0) is equivalent to group()

b12cd34ef567g8901
b12cd34ef567g8901
b12
cd34
b12cd34ef567g8901

print(x1.groups())
print(x2.groups())  # Get a tuple composed of strings matched by all subgroups

()
('b12', 'cd34', 'ef567', 'g8901')

Regular expressions use (? P < name >...) syntax to name groups

x3 = re.search(r'(?P<name1>ab\d+)(?P<name2>cd\d+)(?P<name3>ef\d+)(?P<name4>g\d+)', 'ab12cd34ef567g8901')

print(x3.lastgroup)  # Get the group name of the last group
print(x3.groupdict())  # Save the grouped data with group name in the form of dictionary

name4
{'name1': 'ab12', 'name2': 'cd34', 'name3': 'ef567', 'name4': 'g8901'}

re.compile

Can replace regular rules

patten = re.compile(r"ab\d+")
print(patten.search('ab1234'))

Regular modifier

print(re.search(r'a', 'ABC'))
print(re.search(r'a', 'ABC', re.I)) # re.I is the regular modifier

None
<re.Match object; span=(0, 1), match='A'>

Regular modifier		meaning
re.A	re.ASCII	Match ASCII only
re.I	re.IGNORECASE	Ignore case matching
re.L	re.LOCALE	The \ w, \W, \b, \B and case sensitive matches are determined by the current language region
re.M	re.MULTILINE	The beginning of each line and the character immediately after '^' are matched; The style character '$' matches the end of the string and the end of each line (the symbol before the newline character). By default, '^' matches the beginning of the string and '$' matches the end of the string.
re.S	re.DOTALL	Let's go Special characters match any character, including line breaks
re.X	re.VERBOSE	When there are # non character sets and escape sequences in a line, all characters after it are comments

Regular matching rule

Numbers and letters represent itself and have no special meaning
\It has special meaning and is used to escape
Most letters have a special meaning after they are preceded by \ escape
Most punctuation marks have special meanings
If you want to represent the punctuation itself, you need to use the \ escape character

Special meaning of escape letter

pattern	describe
\w	Match alphanumeric underscores
\W	Match non numeric alphabetic underscores
\s	Match any white space character, equivalent to [\ n\r\f].
\S	Match any non null character
\d	Match any number, equivalent to [0-9].
\D	Match any non number
\A	Start of matching string
\Z	The matching string ends. If there is a newline, only the end string before the newline is matched.
\z	End of matching string
\G	Match the position where the last match was completed.
\b	Match a word boundary, that is, the position between the word and the space. For example, 'er\b' can match 'er' in 'never', but not 'er' in 'verb'.
\B	Matches non word boundaries‘ er\B 'can match' er 'in' verb ', but cannot match' er 'in' never '.
\n. \ T, wait.	Match a newline character. Match a tab, etc
\1...\9	Match the contents of the nth group.
\10	Match the content of the nth group if it is matched. Otherwise, it refers to the expression of octal character code.

print(re.findall(r'\s', 'hello world, hello python'))

[' ', ' ', ' ']

Special meaning of punctuation

‘_’ No longer escaped
Only characters with special meaning in regular expressions are escaped. So, '!'‘ ','% ',' '', ',' / ',': ','; ',' < ',' = ',' > ',' @ 'and' ` 'will no longer be escaped.

Programmer Think