Special characters
In the string used to compile regular expressions, some characters and strings will be given special meanings.
To restore the original meaning of characters, you can escape with [\], and each [\] can only escape the next character
wildcard [.] not \ n | \d | \D | \s | \S | \w | \W |
---|---|---|---|---|---|---|
number | Non numeric | blank | Non blank | word | Non word | |
Legal character | \t | \n | \r | \f | \a | \e |
Tab | Newline character | Carriage return | Page feed | Alarm symbol | Escape | |
Boundary match | ^ | $ | \A | \Z | \b | \B |
Start of line | End of line | Start with string | End of string | Word boundary | Non word boundary |
- d-digital, s-space, w-word, b-blank
- Blank characters include: space, tab, carriage return, page feed, line feed, etc
- Word characters include: 0 ~ 9, English letters, underscores
- In single line mode, [^] = [\ A], [$] = [\ Z]
def test(pattern): string = 'App 8.0\n' a = re.findall(pattern,string) print(a) test('\d') #['8', '0'] test('\D') #['A', 'p', 'p', ' ', '.', '\n'] test('\s') #[' ', '\n'] test('\S') #['A', 'p', 'p', '8', '.', '0'] test('\w') #['A', 'p', 'p', '8', '0'] test('\W') #[' ', '.', '\n'] test('.') #['A', 'p', 'p', ' ', '8', '.', '0'] test('^A') #['A'] test('\n$') #['\n'] test('\AA') #['A'] test('\n\Z') #['\n'] test('p\\b') #['p '] matches a blank string on the right. One more [\] is required. The reason is unknown test('\Bp') #['p ','p'] matches a string that is not blank on the left test('\Bp\B') #['p '] matches a string that is not blank on either side
#Escape special characters
escape(pattern)
Function: add [\] before special characters (characters other than ASCII characters, numbers and underscores)
>>> a = re.escape('1-2') >>> print(a) 1\-2
#Flag flag
Default (flags=0) | change setting | Abbreviation | In line flag |
---|---|---|---|
Wildcards match all Unicode characters | re.ASCII | re.A | (?a) |
Case sensitive | re.IGNORECASE | re.I | (?i) |
Single line mode | re.MOLTILINE | re.M | (?m) |
[.] does not match newline character | s.DOTALL | re.S | (?s) |
Expression is not allowed to be written in separate lines | re.VERBOSE | re.X | (?x) |
Do not display Debug information | re.DEBUG | - | - |
re.findall('\w','España') #['E', 's', 'p', 'a', 'ñ', 'a'] re.findall('\w','España',re.A) #['E', 's', 'p', 'a', 'a'] re.findall('a','Abc') #[] re.findall('a','Abc',re.I) #['A'] re.findall('^a','abc\nabc') #['a'] re.findall('^a','abc\nabc',re.M) #['a', 'a'] re.findall('c$','abc\nabc') #['c'] re.findall('c$','abc\nabc',re.M) #['c', 'c'] re.findall('.','abc\n') #['a', 'b', 'c'] re.findall('.','abc\n', re.S) #['a', 'b', 'c', '\n'] a = re.findall("""a #Letter 1 b #Letter 2 c #Letter 3 "" ",'www.abc. Com ') >>> [] a = re.findall("""a #Letter 1 b #Letter 2 c #Letter 3 "" ", 'www.abc. Com', re. X) >>> ['abc'] >>> re.findall('.','abc\n',re.DEBUG) ANY None 0. INFO 4 0b0 1 1 (to 5) 5: ANY 6. SUCCESS ['a', 'b', 'c']
Expression
Expression | effect |
---|---|
| | Match one of two characters |
[express] | Markup bracket expression -- matches a single character |
(express) | Tag group (subexpression) -- matching string |
{n,m},{n,},{,m} | The specified character or group appears n~m times |
{n} | The character or group must appear exactly n times |
Or [|]
re.findall('a|b','abc') #['a', 'b']
Bracket expression []
- Enumeration. [abc]: one of characters a, b, c
- Range. [az]: a character in az
- Ask No. [^ abc]: a character other than a, b, c
re.findall('[ab]','abc') #['a', 'b'] re.findall('[a-c]','abc') #['a', 'b', 'c'] re.findall('[^ab]','abc') #['c'] re.findall('[^a-b]','abc') #['c']
Group ()
Difference between group and ordinary string
When there is only one group, the matching result of the first group is returned directly
re.findall("ab","abab") #['ab', 'ab'] re.findall("(ab)","abab") #['ab', 'ab'] re.findall("(a)b","abab") #['a', 'a'] re.findall("a(b)","abab") #['b', 'b']
When there are multiple groups, the matching result of each group is an element in the [whole expression matching result (tuple)]
re.findall("(a)(b)","abab") #[('a', 'b'), ('a', 'b')] re.findall("(a)(b)(a)","abab") #[('a', 'b', 'a')] re.findall("(a)(b)(a)(b)","abab") #[('a', 'b', 'a', 'b')]
Naming and referencing matching results
When a string segment repeats in the whole string, you can put the previous matching results directly into the expression
grammar | effect |
---|---|
(exp) | Automatically name the matching results as 1, 2 |
\i | Reference the i th result |
(?Pexp) | Name the matching result as [name] |
(?P=name) | Reference the specified result |
Default name re.findall("(a)(b)\\1","abab") #[('a', 'b')] re.findall("(a)(b)a","abab") #[('a', 'b')] re.findall("(a)(b)\\2","abab") #[] re.findall("(a)(b)b","abab") #[] Specify name re.findall("(?P<my_name>a)(b)(?P=my_name)","abab") #[('a', 'b')]
Position limit group
Limit the contents on both sides of the matching result. The contents on both sides are written in the sub expression and do not appear in the matching result
[(? < = Exp1) or (? <! Exp2)] + [main expression] + [(? = exp3) or (?! exp4)]
Subexpression | condition | relative position | |
---|---|---|---|
exp1 | Must appear | In the of matching content | left |
exp2 | Must not appear | left | |
exp3 | Must appear | right | |
exp4 | Must not appear | right |
def test(pattern): a = re.findall(pattern, 'abc') print(a) test('(?<=a).+?(?=c)') #['b '] B has a on the left and c on the right test('(?<=a).+?(?!c)') #['bc '] BC has a on the left and no c on the right test('(?<!a).+?(?=c)') #['ab '] AB has no a on the left and c on the right test('(?<!a).+?(?!c)') #['a ','c'] there is no a on the left and no C on the right of a and C
More usage of groups
classification | grammar | effect |
---|---|---|
Matches the results of exp but does not capture | (?:exp) | Cannot be referenced later |
Application scope of flag group | (?aiLmsux) | Entire regular expression |
(?imsx-imsx:exp) | Current group | |
notes | (?#command) | Explain that regular expressions are not affected |
Frequency limit
The re expression defaults to greedy mode and matches as many characters as possible. To switch to reluctantly mode, you need to add [?] after the frequency limit
Symbol | Matching times |
---|---|
? | 0 ~ 1 times |
* | 0~N times |
+ | 1~N times |
{n,m},{n,},{,m} | n~m times |
{n} | Just n times |
Functions under re module
Compile pattern -- matching (specified range) -- return matching result -- processing result
When returning matching results, you can use different functions to determine the matching group, return form and return content (string or position)
Compiling regular expressions
Compile (pattern, flags = 0): compile strings into regular expressions to facilitate later calls
>>> a = re.compile('1') #Compiling regular expressions >>> b = a.findall('123') #Match a string with a regular expression and return the result >>> print(b) ['1']
Specify matching range
Parameters: pattern, string, flags=0
function | Matching range |
---|---|
match() | Match from the beginning of the string |
fullmatch() | The entire string is required to match the regular expression |
a = re.match('1', '12') b = re.match('1', '21') print(a) #<re.Match object; span=(0, 1), match='1'> print(b) #None a = re.fullmatch('1', '12') b = re.fullmatch('12','12') print(a) #None print(b) #<re.Match object; span=(0, 2), match='12'>
Return matching results
Parameters: pattern, string, flags=0
function | Returned matching object | Return form |
---|---|---|
search() | first | character string |
findall() | All | list |
finditer() | All | iterator |
pattern = '1' string = '11' a = re.search(pattern, string) b = re.findall(pattern, string) c = re.finditer(pattern, string) print(a.group()) #1 print(b) #['1', '1'] for i in c: print(i.group()) >>> 1 1
search(), match() return value function
Return matching content
function | parameter | Matching groups | Return form |
---|---|---|---|
getitem() | g (non default) | One of the specified groups | character string |
group() | [group1],... | Specified groups | character string |
groups() | nothing | All groups | tuple |
groupdict() | nothing | All groups | Dictionary, group name key |
- group(): when the parameter is 0 by default, the matching results of all groups are returned. The parameter can only be Arabic numerals
- groupdict(): only works on groups with custom names
a = re.search('(a)(b)','abab') a.__getitem__(0) #'ab' a.__getitem__(1) #'a' a.__getitem__(2) #'b' a.group() #'ab' a.group(0) #'ab' a.group(1) #'a' a.group(2) #'b' a.group(1,2) #('a', 'b') a.groups() #('a', 'b') a.groupdict() #{} a = re.search('(?P<name1>a)(b)','abab') a.groupdict() #{'name1': 'a'} a = re.search('(?P<name1>a)(?P<name2>b)','abab') a.groupdict() #{'name1': 'a', 'name2': 'b'}
Return to matching location
function | parameter | Returns the matching of the specified group |
---|---|---|
start() | [group1] | Start position |
end() | [group1] | End position |
span() | [group1] | Start and end position |
- Only one parameter is accepted. When the parameter defaults to 0, the matching positions of all groups are returned
a = re.search('(ab)(cd)','abcd') a.span() #(0, 4) a.span(0) #(0, 4) a.span(1) #(0, 2) a.span(2) #(2, 4)
More parameters
parameter | meaning |
---|---|
pos | Start position of string |
endpos | End of string |
re | Regular expressions used for matching |
string | Match the string used |
lastindex | Index of the last group to match (integer) |
lastgroup | The name of the last group to match |
a = re.search('c','abc') a.pos #0 a.endpos #3 a.re #re.compile('c') a.string #'abc' a = re.search('(?P<name1>a)(?P<name2>b)','abc') a.lastindex #2 a.lastgroup #'name2'
Process matching objects
Replace sub
Function: replace the matching result with repl, which can be a string or a function
Parameters: pattern, repl, string, count=0, flags=0, where count is the number of replacement times, the default is 0, and all are replaced
Split split
Function: returns the split substring in list form
Parameters: pattern, string, maxplit = 0, flags = 0, where maxplit is the number of divisions