Systematic learning of regular expressions
Concept:
- Regular expressions are used to find strings that conform to certain patterns, including what string it is, how many times it is repeated, where it is, and what additional constraints. Each sentence here corresponds to a syntax of the regular expression.
1. Seven Realms of regular expressions
#Seven Realms of regular expressions # level1 - Fixed string import re print(re.findall(r'168', text)) #fianall returns a list level2 - A certain type of character print(re.findall(r'\d', text)) #\d returns a single number print(re.findall(r'\d', text)) #\d returns a single number print(re.findall(r'\D', text)) #\D returns a non numeric value print(re.findall(r'\w', text)) #\w returns other characters that are not punctuated print(re.findall(r'[1-5]', text)) #[] means just pick one of them print(re.findall(r'[Gao Chongfei]', text)) #[] means just pick one of them # level3 - repeat a certain type of character text = 'Uncle Mai's height:178,Weight: 168, student number: 123456, password:9527' print(re.findall(r'\d+', text)) #+Decorate the front print(re.findall(r'\d?', text)) #? Represents 0 or 1, and the results are ['', '', '', '', '', '1', '7', '8', '', '', '', '', '', '1', '6', '8', '', '', '', '', '', '', '', '1', '2', '3', '4', '5', '6', '', '', '', '', '', '', '9', '5', '2', '7', ''] print(re.findall(r'\d*', text)) #*Represents 0 or more, and the results are ['', '', '', '', '', '178', '178', '168', '168', '', '', '', '', '', '', '', '123456', '', '', '', '', '9527', ''] print(re.findall(r'\d{4,}', text))#{} indicates how many times the previous is repeated. It is an interval. The representation method is greater than 4 times, and the output result is: ['123456 ',' 9527 '] # Level4 - combined level2 text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 18887654321. His favorite number is 01234567891 and his landline is 0571-52152166' print(re.findall(r"\d{3,4}-\d{7,8}",text)) #The output result is: ['0571-52152166 '], the area code can match three or four digits, and the phone can match seven or eight digits # leve5 - multiple situations # # Requirement: find out the mobile phone number or landline number text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 18887654321. His favorite number is 01234567891 and his landline is 0571-52152166' patter = "\d{3,4}-\d{7,8}|1\d{10}" #|It means anything print(re.findall(patter,text)) # # Level 6 - restricted location # #Requirement: the mobile phone number or landline number at the beginning of the sentence text = '18812345678,He also has a telephone number of 18887654321, his favorite number is 01234567891, and his landline is 0571-52152166' parter = "^1\d{10}|\d{4}-\d{8}" #^What does it start with print(re.findall(parter,text)) # Level 7 - internal constraints # # Requirements: find out the string with three repeated letters in the form of bar and dardar text = 'barbar carcar harhel' print(re.findall(r'(\w{3})(\1)', text)) #(): indicates that a group is well divided, and \: indicates a reference\ 1 means I want to be the same as the one in the preceding bracket (\ 2) means the same as the second person in the preceding bracket #() divided into two parts
2. Steps to write regular expressions
How to write regular expressions? I summarized several steps. No matter how complicated it is, it's basically hard to try.
We still take the landline telephone number containing extension number as an example, such as 0571-88776655-9527, to demonstrate the following steps:
-
Make sure that the schema contains several sub schemas
It contains three sub modes: 0571-88776655-9527. The three sub modes are connected by fixed characters.
-
What is the character classification of each part
The three sub modes are all numeric types and can be \ d. Now you can write out the mode as: \ d-\d-\d
-
How do the sub patterns repeat
The first sub mode is repeated 3 to 4 times because there are 010 and 021 municipalities directly under the central government
The second sub mode is repeated 7 to 8 times. In some areas, there are only 7-digit telephone numbers
Repeat the third sub mode 3-4 times
After adding the number limit, the mode becomes: \ d{3,4}-\d{7,8}-\d{3,4}
However, some landlines do not have extension numbers, so we use the or operator to support both:
\d{3,4}-\d{7,8}-\d{3,4}|\d{3,4}-\d{7,8}
-
Are there external location restrictions
No,
-
Is there any internal restriction relationship
No,
#Requirement: matching extension number tex = "0571-88776655-9527.dfvgai;;sdbv;asdv;asv;iasvi;cav" pater = "\d{4}-\d{8}-\d{4}" result = re.findall(pater,tex) print(result)
[^ 3-8] means negative, except between 3-8
abc{8}:c repeat 8 times; (abc){8}:abc repeats 8 times
4. Usage of Python regular module re
python's re module is relatively simple, including the following methods:
- re.search(): find the characters that Match the pattern, return only the first one, and return the Match object, which is used to find whether there is such a thing in the file,
- re.match(): the same as search, but it must match from the beginning of the string (like search, only one is returned, and an object is returned, which contains many methods and properties)
- re.findall(): returns a list of all matching strings
- re. Finder (): returns an iterator containing all matches, that is, Match objects
text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 188876543210581-52152166 His favorite number is 01234567891 and his landline is 0571-52152166' it = re.finditer(r"(\d{4})-(\d{8})",text) for m in it: print(m) #It returns an object and an iterator
-
re.sub(): replace the matching string and return the text after replacement
-
re.subn(): replace the matching string, and return the text and times of replacement
text = "aBc,dafaa Abc ABC" result = re.sub(r"abc","***",text,flags=re.I) #Output results: * * *, dafaa * * * * * *, with the target to be matched in front and the content to be matched in the back print(result)
-
re.split(): use the string matching the expression as the separator to split the original string
text = "aBc,dafaa Abc ABC" result = re.split(r"\s*[,;/]\s*",text) #Split, the output result is ['abc ',' dafaa ABC '] print(result)
- re.compile(): compile regular expressions into an object for later use # Compile: turns a regular expression into a transitive object