[Python task oriented] text content matching (regular expression, re module)

Posted by psyion on Tue, 11 Jan 2022 11:46:50 +0100

[Python] advanced notes Part 1: file reading and writing and regular expressions
For the full range of navigation, see: Python tutorial collation

🍀 Broken thoughts 🍀
Hello, Mina sang, this is Yang Si'er studying in the UK. The keywords of my blog focus on algorithms, robots, artificial intelligence, mathematics and so on. Let's pay attention to it and keep high-quality output.
🌸 Nagging QQ group: Rabbit Ji's magic workshop (942848525)
⭐ Station B account number: Yang Si'er is also practicing immortality in science today (please pay attention to the UP main cross station)

🌟 regular expression

⭐ Overview

  • learning motivation
    1. Text data processing has become one of the common programming tasks
    2. The search, location and extraction of text content are logically complex work
    3. In order to solve the above problems quickly and conveniently, regular expression technology is produced
  • definition
    That is, the high-level matching pattern of text. Its essence is a string composed of a series of characters and special symbols, which is a regular expression.
  • principle
    Through ordinary characters and characters with specific meaning, a string is formed to describe certain string rules, such as repetition, position, etc., to express a certain kind of specific string, and then match.
  • Learning objectives
    1. Master regular expression metacharacters
    2. Be able to read common regular expressions and edit simple regular rules
    3. Be able to skillfully use re module to operate regular expressions

⭐ Use of metacharacters

categoryMetacharacter
Match character. [...] [^...] \d \D \w \W \s \S
Match repeat* + ? {n} {m,n}
Match location^ $ \b \B
other| () \

Ordinary character
Matching rule: each ordinary character matches its corresponding character

e.g.
In : re.findall('ab',"abcdefabcd")
Out: ['ab', 'ab']

Note: regular expressions can also match Chinese in python

Or relationship
Metacharacter:|
Matching rule: match any regular expression on both sides

e.g.
In : re.findall('com|cn',"www.baidu.com/www.tmooc.cn")
Out: ['com', 'cn']

Match single character
Metacharacter:
Matching rule: matches any character except line feed

e.g.
In : re.findall('Zhang.Abundant',"Zhang Sanfeng,Airpp ,Wu Feng Zhang")
Out: ['Zhang Sanfeng', 'Airpp ', 'Wu Feng Zhang']

Match character set
Metacharacter: [character set]
Matching rule: matches any character in the character set

Expression form
[aeiou you I he] represents any character in []
[0-9],[a-z],[A-Z] represents any character in the interval
[#? 0-9a-z] mixed writing, general interval expression is written later

e.g.
In : re.findall('[aeiou]',"How are you!")
Out: ['o', 'a', 'e', 'o', 'u']

Match character set inverse
Metacharacter: [^ character set]
Matching rule: matches any character except the character set

e.g.
In : re.findall('[^0-9]',"Use 007 port")
Out: ['U', 's', 'e', ' ', ' ', 'p', 'o', 'r', 't']

Duplicate matching character

Metacharacter:*
Matching rule: matches the previous character 0 or more times

e.g.
In : re.findall('wo*',"wooooo~~w!")
Out: ['wooooo', 'w']

Metacharacter:+
Matching rule: the character before matching appears 1 or more times. The greedy principle is adopted to return long strings first. If you add? Then abandon the greedy principle and directly return the shortest matching string.

e.g.
In : re.findall('[A-Z][a-z]+',"Hello World")
Out: ['Hello', 'World']

Metacharacter:?

Matching rule: 0 or 1 occurrences of characters before matching

e.g. Match integer
In [28]: re.findall('-?[0-9]+',"Jame,age:18, -26")
Out[28]: ['18', '-26']

Metacharacter: {n}
Matching rule: matches the previous character n times

e.g. Match mobile phone number
In : re.findall('1[0-9]{10}',"Jame:13886495728")
Out: ['13886495728']

Metacharacter: {m,n}
Matching rule: match the previous character m to n times.

e.g. matching qq number
In : re.findall('[1-9][0-9]{5,10}',"Baron:1259296994") 
Out: ['1259296994']

Match string start position
Metacharacter:^
Matching rule: matches the beginning of the target string

e.g.
In : re.findall('^Jame',"Jame,hello")
Out: ['Jame']

End of matching string
Metacharacter:$
Matching rule: matches the end of the target string

e.g.
In : re.findall('Jame$',"Hi,Jame")
Out: ['Jame']

Rule tip: ^ and $must appear at the beginning and end of regular expressions. If both occur at the same time, the middle part must match the entire content of the entire target string.

Match any (non) numeric character
Metacharacters: \ d \D
Matching rules: \ D matches any numeric character, \ D matches any non numeric character

e.g. Matching port
In : re.findall('\d{1,5}',"Mysql: 3306, http:80")
Out: ['3306', '80']

Match any (non) ordinary character
Metacharacters: \ w \W
Matching rules: \ wmatch normal characters, \ wmatch non normal characters
Note: ordinary characters refer to numbers, letters, underscores and Chinese characters. (you can extract the numbers in the formula)

e.g.
In : re.findall('\w+',"server_port = 8888")
Out: ['server_port', '8888']

Match any (non) null character
Metacharacters: \ s \S
Matching rules: \ S matches null characters, \ S matches non null characters
Description: a null character refers to a space \ r \n \t \v \f character

e.g.
In : re.findall('\w+\s+\w+',"hello    world")
Out: ['hello    world']

Boundary position of matching (non) words
Metacharacters: \ b \B
Matching rules: \ B represents word boundary, and B represents non word boundary
Note: the word boundary refers to the junction between the number letter (Chinese character) underline and other characters.

e.g.
In : re.findall(r'\bis\b',"This is a test.")
Out: ['is']

Note: when the metacharacter symbol conflicts with the escape character in the Python string, you need to use r to declare the regular expression string as the original string. If you are not sure that those are the escape characters of the Python string, you can add r to all regular expressions.

⭐ Matching rules

# Special symbol matching
result = re.findall(r'-?\d+\.?\d*',"-5.4 3.28 45 76.3 -21")
print(result)

# greedy
result = re.findall("ab+","abbbbbbbbbc")
print(result)
# Non greedy (lazy mode) + -- > +?
result = re.findall("ab+?","abbbbbbbbbc")
print(result)

# Subgroup
result = re.search("(ab)+","abababab")
result = re.search(r'(?P<xing>king|Lee)\w{1,3}', "Glory of Kings")
print(result.group())  # Get matching content

✨ Special character matching

  • Purpose: if the matched target string contains regular expression special characters, escape processing is required when the metacharacter in the expression wants to express its own meaning.
  • Special characters:. * +? ^$ [] () {} | \
  • Operation method: add \ before the regular expression metacharacter, then the metacharacter is to remove its special meaning, which means the character itself
e.g. Match special characters . When using \. Meaning of expression itself
In : re.findall('-?\d+\.?\d*',"123,-123,1.23,-1.23")
Out: ['123', '-123', '1.23', '-1.23']

✨ Greedy mode and non greedy mode

  • definition

    Greedy mode: by default, matching repeated metacharacters always matches as much backward as possible. For example: * +? {m,n}
    Non greedy mode (lazy mode): make matching repeated metacharacters as little backward matching content as possible.

  • Convert greedy mode to non greedy mode
    Add '?' after the corresponding matching repeated metacharacter No

    *  ->  *?
    +  ->  +?
    ?  ->  ??
    {m,n} -> {m,n}?
    
e.g.
In : re.findall(r'\(.+?\)',"(abcd)efgh(higk)")
Out: ['(abcd)', '(higk)']

✨ Regular expression grouping

  • definition
    In regular expressions, the internal grouping of regular expressions is established with (). Subgroups are part of regular expressions and can be used as internal overall operation objects.
  • Function: it can be operated as a whole to change the operation object of metacharacter
e.g.  change +Duplicate object number
In : re.search(r'(ab)+',"ababababab").group()
Out: 'ababababab'

e.g. change |No. operation object
In : re.search(r'(king|Lee)\w{1,3}',"Glory of Kings").group()
Out: 'Glory of Kings'

Capture group
Capture group is also a subgroup in essence, but it has a name to express the meaning of the subgroup. This named subgroup is called capture group.
Format: (? P < name > pattern)

e.g. Name the subgroup "pig"
In : re.search(r'(?P<pig>ab)+',"ababababab").group('pig')
Out: 'ab'

matters needing attention

  • A regular expression can contain multiple subgroups
  • Subgroups can be nested, but should not be too complex
  • Subgroup serial numbers are generally counted from outside to inside and from left to right

✨ Regular expression matching principle

  1. Correctness, can correctly match the target string
  2. Exclusivity, which matches as little as possible other than the target string
  3. Comprehensiveness, taking into account all situations of the target string as far as possible without omission

⭐ Python re module usage

✨ Basic function usage

 re.findall(pattern,string)
 function: Match the contents of the target string according to the regular expression
 parameter: pattern  regular expression 
      string Target string
 Return value: Matching content list,If the regular expression has subgroups, only the contents corresponding to the subgroups can be obtained
 re.split(pattern,string,max)
 function: Use regular expressions to match content,Cut target string
 parameter: pattern  regular expression 
      string Target string
      max How many parts can be cut at most
 Return value: Cut content list
 re.sub(pattern,replace,string,count)
 function: Replace the regular expression with a string
 parameter: pattern  regular expression 
      replace  Replaced string
      string Target string
      count  How many places can be replaced at most,Replace all by default
 Return value: Replaced string

✨ Generate match object

 re.finditer(pattern,string)
 function: Match the contents of the target string according to the regular expression
 parameter: pattern  regular expression 
      string Target string
 Return value: Matching results**iterator **
re.match(pattern,string)
Function: match the start position of a target string
 Parameters: pattern regular
	   string  Target string
 Return value: matching content match object
re.search(pattern,string)
Function: match the first matching content of the target string
 Parameters: pattern regular
	   string  Target string
 Return value: matching content match object

✨ match object usage

  • match.span() get the start and end position of the matching content
  • match.group(n = 0)
    Function: get matching content of match object
    Parameter: the default value is 0, which means to obtain the content of the whole match object. If it is a serial number or group name (capture group), it means to obtain the content of the corresponding subgroup.
    Return value: matching string

Topics: Python regex