Systematic learning of regular expressions

Posted by Goon on Thu, 13 Jan 2022 13:35:58 +0100

Systematic learning of regular expressions

Concept:

  • Regular expressions are used to find strings that conform to certain patterns, including what string it is, how many times it is repeated, where it is, and what additional constraints. Each sentence here corresponds to a syntax of the regular expression.

1. Seven Realms of regular expressions

#Seven Realms of regular expressions

# level1 - Fixed string
import re

 print(re.findall(r'168', text)) #fianall returns a list

level2 - A certain type of character

print(re.findall(r'\d', text)) #\d returns a single number
print(re.findall(r'\d', text)) #\d returns a single number
print(re.findall(r'\D', text)) #\D returns a non numeric value
print(re.findall(r'\w', text)) #\w returns other characters that are not punctuated
print(re.findall(r'[1-5]', text)) #[] means just pick one of them
print(re.findall(r'[Gao Chongfei]', text)) #[] means just pick one of them

# level3 - repeat a certain type of character

text = 'Uncle Mai's height:178,Weight: 168, student number: 123456, password:9527'
print(re.findall(r'\d+', text)) #+Decorate the front
print(re.findall(r'\d?', text)) #? Represents 0 or 1, and the results are ['', '', '', '', '', '1', '7', '8', '', '', '', '', '', '1', '6', '8', '', '', '', '', '', '', '', '1', '2', '3', '4', '5', '6', '', '', '', '', '', '', '9', '5', '2', '7', '']
print(re.findall(r'\d*', text)) #*Represents 0 or more, and the results are ['', '', '', '', '', '178', '178', '168', '168', '', '', '', '', '', '', '', '123456', '', '', '', '', '9527', '']
print(re.findall(r'\d{4,}', text))#{} indicates how many times the previous is repeated. It is an interval. The representation method is greater than 4 times, and the output result is: ['123456 ',' 9527 ']

# Level4 - combined level2

text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 18887654321. His favorite number is 01234567891 and his landline is 0571-52152166'
print(re.findall(r"\d{3,4}-\d{7,8}",text)) #The output result is: ['0571-52152166 '], the area code can match three or four digits, and the phone can match seven or eight digits

# leve5 - multiple situations
#
# Requirement: find out the mobile phone number or landline number

text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 18887654321. His favorite number is 01234567891 and his landline is 0571-52152166'
patter = "\d{3,4}-\d{7,8}|1\d{10}" #|It means anything
print(re.findall(patter,text))
#
# Level 6 - restricted location
#
#Requirement: the mobile phone number or landline number at the beginning of the sentence

text = '18812345678,He also has a telephone number of 18887654321, his favorite number is 01234567891, and his landline is 0571-52152166'
parter = "^1\d{10}|\d{4}-\d{8}" #^What does it start with
print(re.findall(parter,text))


# Level 7 - internal constraints
#
# Requirements: find out the string with three repeated letters in the form of bar and dardar

text = 'barbar carcar harhel'
print(re.findall(r'(\w{3})(\1)', text)) #(): indicates that a group is well divided, and \: indicates a reference\ 1 means I want to be the same as the one in the preceding bracket (\ 2) means the same as the second person in the preceding bracket
#() divided into two parts

2. Steps to write regular expressions

How to write regular expressions? I summarized several steps. No matter how complicated it is, it's basically hard to try.

We still take the landline telephone number containing extension number as an example, such as 0571-88776655-9527, to demonstrate the following steps:

  • Make sure that the schema contains several sub schemas

    It contains three sub modes: 0571-88776655-9527. The three sub modes are connected by fixed characters.

  • What is the character classification of each part

    The three sub modes are all numeric types and can be \ d. Now you can write out the mode as: \ d-\d-\d

  • How do the sub patterns repeat

    The first sub mode is repeated 3 to 4 times because there are 010 and 021 municipalities directly under the central government

    The second sub mode is repeated 7 to 8 times. In some areas, there are only 7-digit telephone numbers

    Repeat the third sub mode 3-4 times

    After adding the number limit, the mode becomes: \ d{3,4}-\d{7,8}-\d{3,4}

    However, some landlines do not have extension numbers, so we use the or operator to support both:

    \d{3,4}-\d{7,8}-\d{3,4}|\d{3,4}-\d{7,8}

  • Are there external location restrictions

    No,

  • Is there any internal restriction relationship

    No,

#Requirement: matching extension number
tex = "0571-88776655-9527.dfvgai;;sdbv;asdv;asv;iasvi;cav"
pater = "\d{4}-\d{8}-\d{4}"
result = re.findall(pater,tex)
print(result)

[^ 3-8] means negative, except between 3-8

abc{8}:c repeat 8 times; (abc){8}:abc repeats 8 times

4. Usage of Python regular module re

python's re module is relatively simple, including the following methods:

  • re.search(): find the characters that Match the pattern, return only the first one, and return the Match object, which is used to find whether there is such a thing in the file,
  • re.match(): the same as search, but it must match from the beginning of the string (like search, only one is returned, and an object is returned, which contains many methods and properties)
  • re.findall(): returns a list of all matching strings
  • re. Finder (): returns an iterator containing all matches, that is, Match objects
text = 'Uncle Mai's telephone number is 18812345678. He also has a telephone number of 188876543210581-52152166 His favorite number is 01234567891 and his landline is 0571-52152166'
it = re.finditer(r"(\d{4})-(\d{8})",text)
for m in it:
    print(m) #It returns an object and an iterator
  • re.sub(): replace the matching string and return the text after replacement

  • re.subn(): replace the matching string, and return the text and times of replacement

text = "aBc,dafaa Abc ABC"
result = re.sub(r"abc","***",text,flags=re.I) #Output results: * * *, dafaa * * * * * *, with the target to be matched in front and the content to be matched in the back
print(result)
  • re.split(): use the string matching the expression as the separator to split the original string

    text = "aBc,dafaa Abc ABC"
    result = re.split(r"\s*[,;/]\s*",text) #Split, the output result is ['abc ',' dafaa ABC ']
    print(result)
    
    • re.compile(): compile regular expressions into an object for later use # Compile: turns a regular expression into a transitive object

Topics: Python