Chapter 10 regular expression re of built-in modules

Posted by GooberDLX on Tue, 21 Dec 2021 07:02:12 +0100

Special characters

In the string used to compile regular expressions, some characters and strings will be given special meanings.

To restore the original meaning of characters, you can escape with [\], and each [\] can only escape the next character

wildcard [.] not \ n	\d	\D	\s	\S	\w	\W
	number	Non numeric	blank	Non blank	word	Non word
Legal character	\t	\n	\r	\f	\a	\e
	Tab	Newline character	Carriage return	Page feed	Alarm symbol	Escape
Boundary match	^	$	\A	\Z	\b	\B
	Start of line	End of line	Start with string	End of string	Word boundary	Non word boundary

d-digital, s-space, w-word, b-blank
Blank characters include: space, tab, carriage return, page feed, line feed, etc
Word characters include: 0 ~ 9, English letters, underscores
In single line mode, [^] = [\ A], [$] = [\ Z]

def test(pattern):
    string  = 'App 8.0\n'
    a = re.findall(pattern,string)
    print(a)
test('\d')  #['8', '0']
test('\D')  #['A', 'p', 'p', ' ', '.', '\n']
test('\s')  #[' ', '\n']
test('\S')  #['A', 'p', 'p', '8', '.', '0']
test('\w')  #['A', 'p', 'p', '8', '0']
test('\W')  #[' ', '.', '\n']
test('.')   #['A', 'p', 'p', ' ', '8', '.', '0']

test('^A')    #['A']
test('\n$')   #['\n']
test('\AA')   #['A']
test('\n\Z')  #['\n']
test('p\\b')  #['p '] matches a blank string on the right. One more [\] is required. The reason is unknown
test('\Bp')   #['p ','p'] matches a string that is not blank on the left
test('\Bp\B') #['p '] matches a string that is not blank on either side

#Escape special characters

escape(pattern)

Function: add [\] before special characters (characters other than ASCII characters, numbers and underscores)

>>> a = re.escape('1-2')
>>> print(a)
1\-2

#Flag flag

Default (flags=0)	change setting	Abbreviation	In line flag
Wildcards match all Unicode characters	re.ASCII	re.A	(?a)
Case sensitive	re.IGNORECASE	re.I	(?i)
Single line mode	re.MOLTILINE	re.M	(?m)
[.] does not match newline character	s.DOTALL	re.S	(?s)
Expression is not allowed to be written in separate lines	re.VERBOSE	re.X	(?x)
Do not display Debug information	re.DEBUG	-	-

re.findall('\w','España')         #['E', 's', 'p', 'a', 'ñ', 'a']
re.findall('\w','España',re.A)    #['E', 's', 'p', 'a', 'a']

re.findall('a','Abc')             #[]
re.findall('a','Abc',re.I)        #['A']

re.findall('^a','abc\nabc')       #['a']
re.findall('^a','abc\nabc',re.M)  #['a', 'a']
re.findall('c$','abc\nabc')       #['c']
re.findall('c$','abc\nabc',re.M)  #['c', 'c']

re.findall('.','abc\n')           #['a', 'b', 'c']
re.findall('.','abc\n', re.S)     #['a', 'b', 'c', '\n']

a = re.findall("""a #Letter 1
               b   #Letter 2
               c   #Letter 3 "" ",'www.abc. Com ')
>>> []
a = re.findall("""a #Letter 1
               b   #Letter 2
               c   #Letter 3 "" ", 'www.abc. Com', re. X)
>>> ['abc']

>>> re.findall('.','abc\n',re.DEBUG)
ANY None

0. INFO 4 0b0 1 1 (to 5)
5: ANY
6. SUCCESS
['a', 'b', 'c']

Expression

Expression	effect
\|	Match one of two characters
[express]	Markup bracket expression -- matches a single character
(express)	Tag group (subexpression) -- matching string
{n,m}，{n,}，{,m}	The specified character or group appears n~m times
{n}	The character or group must appear exactly n times

Or [|]

re.findall('a|b','abc') #['a', 'b']

Bracket expression []

Enumeration. [abc]: one of characters a, b, c
Range. [az]: a character in az
Ask No. [^ abc]: a character other than a, b, c

re.findall('[ab]','abc')    #['a', 'b']
re.findall('[a-c]','abc')   #['a', 'b', 'c']
re.findall('[^ab]','abc')   #['c']
re.findall('[^a-b]','abc')  #['c']

Group ()

Difference between group and ordinary string

When there is only one group, the matching result of the first group is returned directly

re.findall("ab","abab")        #['ab', 'ab']
re.findall("(ab)","abab")      #['ab', 'ab']
re.findall("(a)b","abab")      #['a', 'a']
re.findall("a(b)","abab")      #['b', 'b']

When there are multiple groups, the matching result of each group is an element in the [whole expression matching result (tuple)]

re.findall("(a)(b)","abab")        #[('a', 'b'), ('a', 'b')]
re.findall("(a)(b)(a)","abab")     #[('a', 'b', 'a')]
re.findall("(a)(b)(a)(b)","abab")  #[('a', 'b', 'a', 'b')]

Naming and referencing matching results

When a string segment repeats in the whole string, you can put the previous matching results directly into the expression

grammar	effect
(exp)	Automatically name the matching results as 1, 2
\i	Reference the i th result
(?Pexp)	Name the matching result as [name]
(?P=name)	Reference the specified result

Default name
re.findall("(a)(b)\\1","abab")  #[('a', 'b')]
re.findall("(a)(b)a","abab")    #[('a', 'b')]
re.findall("(a)(b)\\2","abab")  #[]
re.findall("(a)(b)b","abab")    #[]

Specify name
re.findall("(?P<my_name>a)(b)(?P=my_name)","abab")  #[('a', 'b')]

Position limit group

Limit the contents on both sides of the matching result. The contents on both sides are written in the sub expression and do not appear in the matching result

[(? < = Exp1) or (? <! Exp2)] + [main expression] + [(? = exp3) or (?! exp4)]

Subexpression	condition		relative position
exp1	Must appear	In the of matching content	left
exp2	Must not appear		left
exp3	Must appear		right
exp4	Must not appear		right

def test(pattern):
    a = re.findall(pattern, 'abc')
    print(a)
test('(?<=a).+?(?=c)') #['b '] B has a on the left and c on the right
test('(?<=a).+?(?!c)') #['bc '] BC has a on the left and no c on the right
test('(?<!a).+?(?=c)') #['ab '] AB has no a on the left and c on the right
test('(?<!a).+?(?!c)') #['a ','c'] there is no a on the left and no C on the right of a and C

More usage of groups

classification	grammar	effect
Matches the results of exp but does not capture	(?:exp)	Cannot be referenced later
Application scope of flag group	(?aiLmsux)	Entire regular expression
	(?imsx-imsx:exp)	Current group
notes	(?#command)	Explain that regular expressions are not affected

Frequency limit

The re expression defaults to greedy mode and matches as many characters as possible. To switch to reluctantly mode, you need to add [?] after the frequency limit

Symbol	Matching times
?	0 ~ 1 times
*	0~N times
+	1~N times
{n,m}，{n,}，{,m}	n~m times
{n}	Just n times

Functions under re module

Compile pattern -- matching (specified range) -- return matching result -- processing result

When returning matching results, you can use different functions to determine the matching group, return form and return content (string or position)

Compiling regular expressions

Compile (pattern, flags = 0): compile strings into regular expressions to facilitate later calls

>>> a = re.compile('1')  #Compiling regular expressions
>>> b = a.findall('123') #Match a string with a regular expression and return the result
>>> print(b)
['1']

Specify matching range

Parameters: pattern, string, flags=0

function	Matching range
match()	Match from the beginning of the string
fullmatch()	The entire string is required to match the regular expression

a = re.match('1', '12')
b = re.match('1', '21')
print(a)  #<re.Match object; span=(0, 1), match='1'>
print(b)  #None

a = re.fullmatch('1', '12')
b = re.fullmatch('12','12')
print(a)  #None
print(b)  #<re.Match object; span=(0, 2), match='12'>

Return matching results

Parameters: pattern, string, flags=0

function	Returned matching object	Return form
search()	first	character string
findall()	All	list
finditer()	All	iterator

pattern = '1'
string = '11'
a = re.search(pattern, string)
b = re.findall(pattern, string)
c = re.finditer(pattern, string)
print(a.group())  #1
print(b)          #['1', '1']
for i in c: print(i.group())
>>> 
1
1

search(), match() return value function

Return matching content

function	parameter	Matching groups	Return form
getitem()	g (non default)	One of the specified groups	character string
group()	[group1],...	Specified groups	character string
groups()	nothing	All groups	tuple
groupdict()	nothing	All groups	Dictionary, group name key

group(): when the parameter is 0 by default, the matching results of all groups are returned. The parameter can only be Arabic numerals
groupdict(): only works on groups with custom names

a = re.search('(a)(b)','abab')
a.__getitem__(0)  #'ab'
a.__getitem__(1)  #'a'
a.__getitem__(2)  #'b'
a.group()         #'ab'
a.group(0)        #'ab'
a.group(1)        #'a'
a.group(2)        #'b'
a.group(1,2)      #('a', 'b')
a.groups()        #('a', 'b')
a.groupdict()     #{}

a = re.search('(?P<name1>a)(b)','abab')
a.groupdict()  #{'name1': 'a'}
a = re.search('(?P<name1>a)(?P<name2>b)','abab')
a.groupdict()  #{'name1': 'a', 'name2': 'b'}

Return to matching location

function	parameter	Returns the matching of the specified group
start()	[group1]	Start position
end()	[group1]	End position
span()	[group1]	Start and end position

Only one parameter is accepted. When the parameter defaults to 0, the matching positions of all groups are returned

a = re.search('(ab)(cd)','abcd')
a.span()   #(0, 4)
a.span(0)  #(0, 4)
a.span(1)  #(0, 2)
a.span(2)  #(2, 4)

More parameters

parameter	meaning
pos	Start position of string
endpos	End of string
re	Regular expressions used for matching
string	Match the string used
lastindex	Index of the last group to match (integer)
lastgroup	The name of the last group to match

a = re.search('c','abc')
a.pos        #0
a.endpos     #3
a.re         #re.compile('c')
a.string     #'abc'

a = re.search('(?P<name1>a)(?P<name2>b)','abc')
a.lastindex  #2
a.lastgroup  #'name2'

Process matching objects

Replace sub

Function: replace the matching result with repl, which can be a string or a function

Parameters: pattern, repl, string, count=0, flags=0, where count is the number of replacement times, the default is 0, and all are replaced

Split split

Function: returns the split substring in list form

Parameters: pattern, string, maxplit = 0, flags = 0, where maxplit is the number of divisions

Topics: Python regex

Programmer Think