Chapter 10 regular expression re of built-in modules

Posted by GooberDLX on Tue, 21 Dec 2021 07:02:12 +0100

Special characters

In the string used to compile regular expressions, some characters and strings will be given special meanings.

To restore the original meaning of characters, you can escape with [\], and each [\] can only escape the next character

wildcard
[.] not \ n
\d\D\s\S\w\W
numberNon numericblankNon blankwordNon word
Legal character\t\n\r\f\a\e
TabNewline characterCarriage returnPage feedAlarm symbolEscape
Boundary match^$\A\Z\b\B
Start of lineEnd of lineStart with stringEnd of stringWord boundaryNon word boundary
  • d-digital, s-space, w-word, b-blank
  • Blank characters include: space, tab, carriage return, page feed, line feed, etc
  • Word characters include: 0 ~ 9, English letters, underscores
  • In single line mode, [^] = [\ A], [$] = [\ Z]
def test(pattern):
    string  = 'App 8.0\n'
    a = re.findall(pattern,string)
    print(a)
test('\d')  #['8', '0']
test('\D')  #['A', 'p', 'p', ' ', '.', '\n']
test('\s')  #[' ', '\n']
test('\S')  #['A', 'p', 'p', '8', '.', '0']
test('\w')  #['A', 'p', 'p', '8', '0']
test('\W')  #[' ', '.', '\n']
test('.')   #['A', 'p', 'p', ' ', '8', '.', '0']

test('^A')    #['A']
test('\n$')   #['\n']
test('\AA')   #['A']
test('\n\Z')  #['\n']
test('p\\b')  #['p '] matches a blank string on the right. One more [\] is required. The reason is unknown
test('\Bp')   #['p ','p'] matches a string that is not blank on the left
test('\Bp\B') #['p '] matches a string that is not blank on either side

#Escape special characters

escape(pattern)

Function: add [\] before special characters (characters other than ASCII characters, numbers and underscores)

>>> a = re.escape('1-2')
>>> print(a)
1\-2

#Flag flag

Default (flags=0)change settingAbbreviationIn line flag
Wildcards match all Unicode charactersre.ASCIIre.A(?a)
Case sensitivere.IGNORECASEre.I(?i)
Single line modere.MOLTILINEre.M(?m)
[.] does not match newline characters.DOTALLre.S(?s)
Expression is not allowed to be written in separate linesre.VERBOSEre.X(?x)
Do not display Debug informationre.DEBUG--
re.findall('\w','España')         #['E', 's', 'p', 'a', 'ñ', 'a']
re.findall('\w','España',re.A)    #['E', 's', 'p', 'a', 'a']

re.findall('a','Abc')             #[]
re.findall('a','Abc',re.I)        #['A']

re.findall('^a','abc\nabc')       #['a']
re.findall('^a','abc\nabc',re.M)  #['a', 'a']
re.findall('c$','abc\nabc')       #['c']
re.findall('c$','abc\nabc',re.M)  #['c', 'c']

re.findall('.','abc\n')           #['a', 'b', 'c']
re.findall('.','abc\n', re.S)     #['a', 'b', 'c', '\n']

a = re.findall("""a #Letter 1
               b   #Letter 2
               c   #Letter 3 "" ",'www.abc. Com ')
>>> []
a = re.findall("""a #Letter 1
               b   #Letter 2
               c   #Letter 3 "" ", 'www.abc. Com', re. X)
>>> ['abc']

>>> re.findall('.','abc\n',re.DEBUG)
ANY None

0. INFO 4 0b0 1 1 (to 5)
5: ANY
6. SUCCESS
['a', 'b', 'c']

Expression

Expressioneffect
|Match one of two characters
[express]Markup bracket expression -- matches a single character
(express)Tag group (subexpression) -- matching string
{n,m},{n,},{,m}The specified character or group appears n~m times
{n}The character or group must appear exactly n times

Or [|]

re.findall('a|b','abc') #['a', 'b']

Bracket expression []

  1. Enumeration. [abc]: one of characters a, b, c
  2. Range. [az]: a character in az
  3. Ask No. [^ abc]: a character other than a, b, c
re.findall('[ab]','abc')    #['a', 'b']
re.findall('[a-c]','abc')   #['a', 'b', 'c']
re.findall('[^ab]','abc')   #['c']
re.findall('[^a-b]','abc')  #['c']

Group ()

Difference between group and ordinary string

When there is only one group, the matching result of the first group is returned directly

re.findall("ab","abab")        #['ab', 'ab']
re.findall("(ab)","abab")      #['ab', 'ab']
re.findall("(a)b","abab")      #['a', 'a']
re.findall("a(b)","abab")      #['b', 'b']

When there are multiple groups, the matching result of each group is an element in the [whole expression matching result (tuple)]

re.findall("(a)(b)","abab")        #[('a', 'b'), ('a', 'b')]
re.findall("(a)(b)(a)","abab")     #[('a', 'b', 'a')]
re.findall("(a)(b)(a)(b)","abab")  #[('a', 'b', 'a', 'b')]

Naming and referencing matching results

When a string segment repeats in the whole string, you can put the previous matching results directly into the expression

grammareffect
(exp)Automatically name the matching results as 1, 2
\iReference the i th result
(?Pexp)Name the matching result as [name]
(?P=name)Reference the specified result
Default name
re.findall("(a)(b)\\1","abab")  #[('a', 'b')]
re.findall("(a)(b)a","abab")    #[('a', 'b')]
re.findall("(a)(b)\\2","abab")  #[]
re.findall("(a)(b)b","abab")    #[]

Specify name
re.findall("(?P<my_name>a)(b)(?P=my_name)","abab")  #[('a', 'b')]

Position limit group

Limit the contents on both sides of the matching result. The contents on both sides are written in the sub expression and do not appear in the matching result

[(? < = Exp1) or (? <! Exp2)] + [main expression] + [(? = exp3) or (?! exp4)]

Subexpressionconditionrelative position
exp1Must appearIn the of matching contentleft
exp2Must not appearleft
exp3Must appearright
exp4Must not appearright
def test(pattern):
    a = re.findall(pattern, 'abc')
    print(a)
test('(?<=a).+?(?=c)') #['b '] B has a on the left and c on the right
test('(?<=a).+?(?!c)') #['bc '] BC has a on the left and no c on the right
test('(?<!a).+?(?=c)') #['ab '] AB has no a on the left and c on the right
test('(?<!a).+?(?!c)') #['a ','c'] there is no a on the left and no C on the right of a and C

More usage of groups

classificationgrammareffect
Matches the results of exp but does not capture(?:exp)Cannot be referenced later
Application scope of flag group(?aiLmsux)Entire regular expression
(?imsx-imsx:exp)Current group
notes(?#command)Explain that regular expressions are not affected

Frequency limit

The re expression defaults to greedy mode and matches as many characters as possible. To switch to reluctantly mode, you need to add [?] after the frequency limit

SymbolMatching times
?0 ~ 1 times
*0~N times
+1~N times
{n,m},{n,},{,m}n~m times
{n}Just n times

Functions under re module

Compile pattern -- matching (specified range) -- return matching result -- processing result

When returning matching results, you can use different functions to determine the matching group, return form and return content (string or position)

Compiling regular expressions

Compile (pattern, flags = 0): compile strings into regular expressions to facilitate later calls

>>> a = re.compile('1')  #Compiling regular expressions
>>> b = a.findall('123') #Match a string with a regular expression and return the result
>>> print(b)
['1']

Specify matching range

Parameters: pattern, string, flags=0

functionMatching range
match()Match from the beginning of the string
fullmatch()The entire string is required to match the regular expression
a = re.match('1', '12')
b = re.match('1', '21')
print(a)  #<re.Match object; span=(0, 1), match='1'>
print(b)  #None

a = re.fullmatch('1', '12')
b = re.fullmatch('12','12')
print(a)  #None
print(b)  #<re.Match object; span=(0, 2), match='12'>

Return matching results

Parameters: pattern, string, flags=0

functionReturned matching objectReturn form
search()firstcharacter string
findall()Alllist
finditer()Alliterator
pattern = '1'
string = '11'
a = re.search(pattern, string)
b = re.findall(pattern, string)
c = re.finditer(pattern, string)
print(a.group())  #1
print(b)          #['1', '1']
for i in c: print(i.group())
>>> 
1
1

search(), match() return value function

Return matching content

functionparameterMatching groupsReturn form
getitem()g (non default)One of the specified groupscharacter string
group()[group1],...Specified groupscharacter string
groups()nothingAll groupstuple
groupdict()nothingAll groupsDictionary, group name key
  • group(): when the parameter is 0 by default, the matching results of all groups are returned. The parameter can only be Arabic numerals
  • groupdict(): only works on groups with custom names
a = re.search('(a)(b)','abab')
a.__getitem__(0)  #'ab'
a.__getitem__(1)  #'a'
a.__getitem__(2)  #'b'
a.group()         #'ab'
a.group(0)        #'ab'
a.group(1)        #'a'
a.group(2)        #'b'
a.group(1,2)      #('a', 'b')
a.groups()        #('a', 'b')
a.groupdict()     #{}

a = re.search('(?P<name1>a)(b)','abab')
a.groupdict()  #{'name1': 'a'}
a = re.search('(?P<name1>a)(?P<name2>b)','abab')
a.groupdict()  #{'name1': 'a', 'name2': 'b'}

Return to matching location

functionparameterReturns the matching of the specified group
start()[group1]Start position
end()[group1]End position
span()[group1]Start and end position
  • Only one parameter is accepted. When the parameter defaults to 0, the matching positions of all groups are returned
a = re.search('(ab)(cd)','abcd')
a.span()   #(0, 4)
a.span(0)  #(0, 4)
a.span(1)  #(0, 2)
a.span(2)  #(2, 4)

More parameters

parametermeaning
posStart position of string
endposEnd of string
reRegular expressions used for matching
stringMatch the string used
lastindexIndex of the last group to match (integer)
lastgroupThe name of the last group to match
a = re.search('c','abc')
a.pos        #0
a.endpos     #3
a.re         #re.compile('c')
a.string     #'abc'

a = re.search('(?P<name1>a)(?P<name2>b)','abc')
a.lastindex  #2
a.lastgroup  #'name2'

Process matching objects

Replace sub

Function: replace the matching result with repl, which can be a string or a function

Parameters: pattern, repl, string, count=0, flags=0, where count is the number of replacement times, the default is 0, and all are replaced

Split split

Function: returns the split substring in list form

Parameters: pattern, string, maxplit = 0, flags = 0, where maxplit is the number of divisions

Topics: Python regex