day 19 regular expression
1, Regular expression
- Regular expressions are a tool to simplify complex string problems
Match symbol
1. Introduction to re module
re module is a system module used by python to support regular related operations
Fullmatch (regular expression, string) - judge whether the regular expression exactly matches the specified string. If not, the result is None
from re import fullmatch
2. Matching class symbols
Regular expressions are composed of various regular symbols
2.1 ordinary characters - characters that do not have special functions and special meanings in regular characters
Ordinary characters represent the symbol itself in regular expressions, such as AZ, AZ, 0 ~ 9, Chinese
result=fullmatch(r'abc','abc')
print(result)
2.2 special symbols
1) . - Match any character note: a point can only match one character
result=fullmatch(r'a.b','a5b') print(result)
2) \ d - match any numeric character
result=fullmatch(r'a\db','a3b') print(result)
3) \ s - match any white space character
White space characters: space, \ t, \ n
result=fullmatch(r'abc\s123','abc 123') print(result)
4)\D \S -
\D - matches any non numeric character
\S - matches any non white space character
result=fullmatch(r'abc\D12\S3','abc 12a3') print(result)
5) [character set] - matches any character in the character set
Case:
Case 1: all ordinary characters: [xyz12] - match any one of them
Case 2: the matching symbol beginning with \ is included. At this time, the function of the matching symbol is effective; [mn\d] == [mn0123456789]
Case 3: the minus sign is between two symbols, indicating who to whom (judged according to the coding value)
[a-z] - match any lowercase letter
[a-zA-Z] - match any letter
[\ u4e00-\u9fa5] - any Chinese
result=fullmatch(r'a[xyz]c123','axc123') print(result) result=fullmatch(r'a[mn\d]b','a1b') print(result) result=fullmatch(r'1[a-z]2','1f2') print(result) [^character set] - Matches any character that is not in the character set result=fullmatch(r'1[^a-z]2','122') print(result)
2, Matching times
Usage: number of matching symbols
# 1. * - zero or more times (any number) ''' a*b - b Any number of in front a \b*b - b Any number preceded ''' result = fullmatch(r'a*b', 'aaab') print(result) result = fullmatch(r'\d*b', '48798b') print(result) result = fullmatch(r'[xyz]*b', 'xyxzzb') print(result)
# 2. + - one or more times (at least one time) result = fullmatch(r'a+b', 'aaab') print(result) result = fullmatch(r'\d+b', '48798b') print(result) result = fullmatch(r'[xyz]+b', 'xyxzzb') print(result)
# 3. ? - 0 or 1 times result=fullmatch('r\d?abc','8abc') print(result) # Writing a regular expression can match any positive integer (regardless of 0) result=fullmatch(r'[+]?[1-9]\d*','025') print(result)
# 4.{} ''' {M,N} - M reach N second {M,} - at least M second {,N} - most N second {N} - N second ''' result=fullmatch(r'a{3}b','aaab') print(result)
3, Greed and non greed
# 1. Greed and non greed ''' When the matching times are uncertain, the matching mode is divided into greedy and non greedy (the default is non greedy mode) * + {M,N} {M,} {,N} ? -greedy *? +? {M,N}? {M,}? {,N}? ?? - non-greedy be careful: python Medium processing fullmatch There may be greedy and non greedy problems outside ''' result=fullmatch('.+b','ab See the attachment abjfajfbjifajb') print(result) result=search('.+?b','ab See the attachment abjfajfbjifaj') print(result) print('===================================') res1=open('./top250.html','r',encoding='utf-8').read() result=findall(r'<span class="inq">(.+)</span>',res1) print(result)
4, Grouping and branching
from re import * # 1. Grouping - () # Application scenario 1: enclose the parts of the regular expression with () and operate as a whole result=fullmatch(r'([a-z]{3}\d{2}){3}','ffe21hfj67jki89') print(result) # Application scenario 2: repeat - you can use '\' in a regular expression with groups to repeat the content matched by the nth group in front of it # '3a3' '9a9' result=fullmatch(r'(\d)a\1{3}','3a333') print(result) # Application scenario 3: Capture - when using findall, if there is a group in the regular expression, only the matching content in the group will be returned when returning data str1='faa=4324=432f hair Joeafah43141' result=findall(r'[a-z](\d+)',str1) print(result) result=findall(r'([a-z]{2})=(\d{2})',str1) print(result) #[('aa', '43')] print('==========================') # 2. Branch -| # Regular 1 | regular 2 - regular 1 and regular 2 match successfully as long as one of them can match successfully # abc123,abcJKH # result=fullmatch(r'abc\d{3]|abc[A-z]{3}','abc123') result=fullmatch(r'abc(\d{3}|[A-Z]{3})','abc123') print(result)
5, Others
1. Transfer symbol - add \ 'before the symbol with special function or special meaning to make its function or meaning disappear, and program an ordinary symbol
#'a.b' result=fullmatch(r'a\.b','a.b') print(result) result=fullmatch(f'\+?[1-9]\d','+23') print(result) result=fullmatch(r'\(\d{3}\)','(345)') print(result) Another way to make the symbol function disappear: when a single symbol has a special function, it can be added to the square[]Make its function disappear result=fullmatch(r'[+]?[1-9]\d*','23') print(result) result=fullmatch(r'[ab^-]123','^123') print(result)
2. Ignore case: in front of regular (? i)
result=fullmatch(r'(?i)abc','ABC') print(result)
3. Single line matching and multi line matching:
# When multiple lines match Cannot match '\ n' (default) - (? m) result=fullmatch(r'a.b','a\nb') print(result) # Single line matching, when matching Can match '\ n' - add (? s) before the regular result=fullmatch(r'(?s)a.b','a\nb') print(result) result=fullmatch(r'(?si)a.b','a\nB') print(result) # result.group can remove < re Match object; span=(0, 3)
4. Common functions in re module
(commonly used) fullmatch (regular, string) - judge whether the whole string conforms to the rules described by the regular description (exact match). The matching object is returned if the matching is successful, and None is returned if the matching is failed
Match (regular, string) - matches the beginning of the string, returns the matching object if the matching succeeds, and returns None if the matching fails
Search (regular, string) - get the first regular substring in the string, find the matching object corresponding to the returned substring, and return None if not found
(commonly used) find all (regular, string) - get all the regular substrings in the string and return a list. The elements in the list are strings or tuples
Finder (regular, string) - get all substrings in the string that meet the regularity and return an iterator. The elements in the iterator are the matching objects corresponding to the substring
(commonly used) split (regular, string) - cut the string by taking all the regular substrings in the string as the cutting point, and return a list. The elements in the list are strings
(common) sub (regular, string 1, string 2) - replace all regular substrings in string 2 with string 1, and return the replaced new string