python - regular expression (remodule) explanation

Posted by Guardian-Mage on Sat, 19 Feb 2022 22:10:44 +0100

If you are Xiao Bai, this data set can help you become a bull. If you have rich development experience, this data set can help you break through the bottleneck.
2022 Web Full Video Tutorial Front End Architecture H5 vue node Applet Video+Materials+Code+Interview Questions.

When you need to match strings through regular expressions in Python, you can make a module that comes with Python named re.

The approximate matching process for regular expressions is:
1. Compare the characters in the expression with those in the text in turn.
2. If each character can be matched, the match will be successful; Matching fails once unsuccessful characters are matched.
3. If there are quantifiers or boundaries in the expression, the process will be slightly different.

R: Backslashes do not require any special handling in string literal values with the prefix'r'. So r''denotes a string that contains two characters'' and'n', while''denotes a string that contains only one line break.

Use of the re module: import re

re.match function

Syntax: re.match(pattern, string, flags=0)

pattern

Matching Regular Expressions

string

String to match

flags

Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.

  1. re.I Ignore case
  2. re.L stands for the special character set w, W, B, s, S depending on the current environment
  3. re.M Multiline Mode
  4. re.S is. And any characters including line breaks (.excluding line breaks)
  5. re.U stands for the special character set w, W, B, d, D, s, S depending on the Unicode character attribute database
  6. re.X For readability, ignore spaces and comments after #

Attempts to match a pattern from the start of a string, or matches () returns none if the match is not successful. Match successful re. The match method returns a matching object.

If the previous step matches the data, you can use the group method to extract the data. To use the group(num) or groups() Match Object function to get a match expression.

The group() string **, **() is used to group, group() and group (0) are the overall results of matching regular expressions, group(1) lists the first parenthesis matching part, group(2) lists the second parenthesis matching part, and group(3) lists the third parenthesis matching part. No matches succeeded, re.search() returns None.

Give an example:

>>> import re
>>> result = re.match("itcast","itcast.cn")
>>> result.group()
'itcast'

Matching patterns from the string header is perfectly matchable, patternmatching ends, and matching ends, followed by. The cn no longer matches, returning information about the success of the match.

Match a single character

character

function

position

.

Match any 1 character except

[ ]

Match the characters listed in []

d

Match numbers, 0-9

Can be written in character set [...]

D

Match numbers, that is, not numbers

Can be written in character set [...]

s

Match empty space, tab key

Can be written in character set [...]

S

Match empty characters

Can be written in character set [...]

w

Match word characters, that is a-z, A-Z, 0-9, _

Can be written in character set [...]

W

Match Word Characters

Can be written in character set [...]

w

w Matches word characters, a-z, A-Z, 0-9, _

W

Match Word Characters

[...] character set, the corresponding position can be any character in the character set. Characters in a character set can be listed individually or given a range, such as [abc] and [a-c], where the first character is a negation. All special characters, such as']''-'', lose their original meaning in the character set. To use']' -'put in the first character,'^'put in a non-first character.

Give an example:

import re
ret = re.match(".","M")
print(ret.group())
ret = re.match("t.o","too")
print(ret.group())
ret = re.match("t.o","two")
print(ret.group())
# What if hello's? Character? Write, then the regular expression needs? Written h
ret = re.match("h","hello Python")
print(ret.group())
# What if hello's? Character? Write, then the regular expression needs? Written H
ret = re.match("H","Hello Python")
print(ret.group())
# ?? Writing h works
ret = re.match("[hH]","hello Python")
print(ret.group())
ret = re.match("[hH]","Hello Python")
print(ret.group())
ret = re.match("[hH]ello Python","Hello Python")
print(ret.group())
# Multiple Writings Matching 0 to 9
ret = re.match("[0123456789]Hello Python","7Hello Python")
print(ret.group())
ret = re.match("[0-9]Hello Python","7Hello Python")
print(ret.group())
# Match 0 to 3 and 5-9
ret = re.match("[0-35-9]Hello Python","7Hello Python")
print(ret.group())
ret = re.match("[0-35-9]Hello Python","4Hello Python")
#print(ret.group())
ret = re.match("Chang E d Number","Chang'e 1 successfully launched")
print(ret.group())
ret = re.match("Chang E d Number","Chang'e 2 successfully launched")
print(ret.group())

Result:

M
too
two
h
H
h
H
Hello Python
7Hello Python
7Hello Python
7Hello Python
Chang E 1
Chang E 2

Match multiple characters

character

function

position

Expression Instances

Completely matched string

*

Match the previous character 0 times or limit, you can have

Used after a character or (...)

abc*

abccc

Match the previous character once or only once, that is, less than once

Used after a character or (...)

abc+

abccc

Match the previous character once or 0 times, either once or not

Used after a character or (...)

abc

ab,abc

{m}

m occurrences of matching previous character

Used after a character or (...)

ab{2}c

abbc

{m,n}

Matches the previous character from m to n times, matches 0 to N times if M is omitted, matches m to infinite times if n is omitted

Used after a character or (...)

ab{1,2}c

abc,abbc

Give an example:

import re
#: Matched out,? String number? Words? For? Write a character, then? Yes? Write? And these? Write? Is there any?
ret = re.match("[A-Z][a-z]*","M")
print(ret.group())
ret = re.match("[A-Z][a-z]*","MnnM")
print(ret.group())
ret = re.match("[A-Z][a-z]*","Aabcdef")
print(ret.group())
#Match out whether the variable name is valid
names = ["name1", "_name", "2_name", "__name__"]
for name in names:
    ret = re.match("[a-zA-Z_]+[w]*",name)
    if ret:
        print("Variable Name %s Meet Requirements" % ret.group())
    else:
        print("Variable Name %s ?method" % name)
#Match out numbers between 0 and 99
ret = re.match("[1-9]?[0-9]","7")
print(ret.group())
ret = re.match("[1-9]?d","33")
print(ret.group())
# This result isn't what you want, benefit?$ Only then can it be resolved
ret = re.match("[1-9]?d","09")
print(ret.group())
ret = re.match("[a-zA-Z0-9_]{6}","12a3g45678")
print(ret.group())
#Match out a password of 8 to 20 bits, can it be?? Write in English? Words?, Numbers, Underlines
ret = re.match("[a-zA-Z0-9_]{8,20}","1ad12f23s34455ff66")
print(ret.group())

Result:
M
Mnn
Aabcdef
Variable name name1 meets requirements
Variable name_ name meets requirements
Variable Name 2_name method
Variable name_u name_u Meet Requirements
7
33
0
12a3g4
1ad12f23s34455ff66

Match start and end

character

function

^

Beginning of Match String

$

Match end of string

Example: Match 163.com's mailbox address

import re
email_list = ["xiaoWang@163.com", "xiaoWang@163.comheihei", ".com.xiaowang@qq.com"]
for email in email_list:
    ret = re.match("[w]{4,20}@163.com$", email)
    if ret:
        print("%s Is the required email address,The result of the match is:%s" % (email, ret.group()))
    else:
        print("%s Not meeting requirements" % email)

Result:

xiaoWang@163.com Is the specified email address, the result of matching is: xiaoWang@163.com
xiaoWang@163.comheihei Not meeting requirements
.com.xiaowang@qq.com Not meeting requirements

Match Grouping

character

function

|

Match any left or right expression

(ab)

Grouping characters in parentheses

um

The string to which the quotation grouping num matches

(P)

Grouping is aliased and matched substring groups are externally obtained by a defined name

(P=name)

The string to which the alias name grouping matches

Example: |

#Match numbers between 0 and 100
import re
ret = re.match("[1-9]?d$|100","8")
print(ret.group()) # 8
ret = re.match("[1-9]?d$|100","78")
print(ret.group()) # 78
ret = re.match("[1-9]?d$|100","08")
# print(ret.group()) # Not between 0-100
ret = re.match("[1-9]?d$|100","100")
print(ret.group()) # 100

Example: ()

#Requirements: match 163, 126, qq mailboxes
ret = re.match("w{4,20}@163.com", "test@163.com")
print(ret.group()) # test@163.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@126.com")
print(ret.group()) # test@126.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@qq.com")
print(ret.group()) # test@qq.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@gmail.com")
if ret:
    print(ret.group())
else:
    print("Not 163, 126, qq mailbox") # Not 163, 126, qq mailbox
#Don't end with 4 or 7? Machine number (11 digits)
tels = ["13100001234", "18912344321", "10086", "18800007777"]
for tel in tels:
    ret = re.match("1d{9}[0-35-68-9]", tel)
    if ret:
        print(ret.group())
    else:
        print("%s Not what you want?Airline Number" % tel)
#Extract area code and phone number
ret = re.match("([^-]*)-(d+)","010-12345678")
print(ret.group())
print(ret.group(1))
print(ret.group(2))

Example: umber

Matches the combination of numeric representations. Each bracket is a combination numbered from the beginning. For example (. +) matches'the'or'55', but does not match'the' (note the space after the combination). This particular sequence can only be used to match the first 99 combinations. If the first digit of a number is 0, or if the number is three octal digits, it will not be considered a combination, but an octal numeric value. Within the'['and']' character set, any numeric escape is considered a character.

Example 1: Match out <html>hh</html>

,...,9, match the content of the nth grouping. As an example, refers to matching the contents of the first group.

import re
# The right way to think about it: What if in the first place? What is in <>, reasonably later? That pair of <> should be what. By quotation? The matched data in the grouping is fine, but note that it is a metastring, a format similar to r''.
ret = re.match(r"<([a-zA-Z]*)>w*</>", "<html>hh</html>")
# Because 2 pairs of data in <>are not? To, so it doesn't match
test_label = ["<html>hh</html>","<html>hh</htmlbalabala>"]
for label in test_label:
    ret = re.match(r"<([a-zA-Z]*)>w*</>", label)
    if ret:
        print("%s This is the right label pair" % ret.group())
    else:
        print("%s This is?Incorrect label" % label)

Result:

hh This is the right label hh This is the wrong label

Example 2: Matched out

www.itcast.cn

import re
labels = ["<html><h1>www.itcast.cn</h1></html>", "<html><h1>www.itcast.cn</h2></html>"]
for label in labels:
    ret = re.match(r"<(w*)><(w*)>.*</></>", label)
    if ret:
        print("%s Is a label that meets the requirements" % ret.group())
    else:
        print("%s Not meeting requirements" % label)

Result:

www.itcast.cn

Is a label that meets the requirements

www.itcast.cn

Not meeting requirements

Example: (P) (P=name)

One for markup and one for reuse in the same regular expression

import re
ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h1></html>")
ret.group()
ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h2></html>")
#ret.group()

re.compile function

The compile function compiles a regular expression and generates a Pattern object for use by the match() and search() functions.

prog = re.compile(pattern)
result = prog.match(string)

Equivalent to

result = re.match(pattern, string)

Give an example:

>>>import re
>>> pattern = re.compile(r'd+')   
m = pattern.match('one12twothree34four', 3, 10) # Match from the position of'1', just match
>>> print m                                         # Return a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # Omit 0
'12'
>>> m.start(0)   # Omit 0
3
>>> m.end(0)     # Omit 0
5
>>> m.span(0)    # Omit 0
(3, 5)

Above, when the match succeeds, a Match object is returned, where:

  • The group([group1,...]) method is used to obtain one or more grouped matching strings, and group() or group(0) can be used directly when the entire matching substring is to be obtained;
  • The start([group]) method is used to get the starting position (index of the first character of the substring) of the grouping match throughout the string, and the default value of the parameter is 0.
  • The end([group]) method is used to get the end position of the grouped matching substring in the entire string (index + 1 of the last character of the substring), with a default parameter of 0;
  • The span([group]) method returns (start(group), end(group))

re.search function

re.search scans the entire string and returns the first successful match, or a None if no match exists.

re.match and re. The difference between search: re.match only matches the beginning of the string. If the beginning of the string does not match the regular expression, the match fails and the function returns None. And re.search matches the entire string until a match is found

Give an example:

import re
ret = re.search(r"d+", "9999 reads")
print(ret.group())

Result:

9999

re.findall function

Finds all the substrings matched by the regular expression in the string and returns a list, or an empty list if no match is found. Note that **:** match and search are matches once findall matches all.

Give an example:

import re
ret = re.findall(r"d+", "python = 9999, c = 7890, c++ = 12345")
print(ret)

Result:

['9999', '7890', '12345']

re.finditer function

Similar to findall, all substrings matching the regular expression are found in the string and returned as an iterator.

import re
it = re.finditer(r"d+", "12a32bc43jf3")
for match in it:
    print(match.group())

Result:

12
32
43
3

Re. Subfunction

Subis written by sub stitute to represent replacement and to replace the matched data.

Syntax: re.sub(pattern, repl, string, count=0, flags=0)

parameter

describe

pattern

Required, representing the pattern string in the regular

repl

Required, replacement, string to replace, or a function

string

Required, the string to be replaced

count

Optional parameter, count is the maximum number of times to be replaced, must be a non-negative integer. If this parameter is omitted or set to 0, all matches will be replaced

flag

Optional parameters, flag bits, to control how regular expressions are matched, such as case sensitivity, multiline matching, and so on.

Example: Add 1 to the number of reading matches

Method 1:

import re
ret = re.sub(r"d+", '998', "python = 997")
print(ret)

Result: python = 998

Method 2:

import re
def add(temp):
    #The int() parameter must be a string, byte-like object or number, not "re.Match"
    strNum = temp.group()
    num = int(strNum) + 1
    return str(num)
ret = re.sub(r"d+", add, "python = 997")
print(ret)
ret = re.sub(r"d+", add, "python = 99")
print(ret)

Result;

python = 998
python = 100

re.subn function

The behavior is the same as sub(), but returns a tuple (string, number of substitutions).

re.subn(pattern, repl, string[, count])

Return: (sub(repl, string[, count]), number of substitutions)

import re
pattern = re.compile(r'(w+) (w+)')
s = 'i say, hello world!'
print(re.subn(pattern, r' ', s))
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
print(re.subn(pattern, func, s))
### output ###
# ('say i, world hello!', 2)
# ('I Say, Hello World!', 2)

re.split function

Cut the string according to the match and return a list.

re.``split(pattern, string, maxsplit=0, flags=0)

parameter

describe

pattern

Matching Regular Expressions

string

String to match

maxsplit

Number of delimitations, maxsplit=1 delimit once, default is 0, unlimited number of times

Give an example:

import re
ret = re.split(r":| ","info:xiaoZhang 33 shandong")
print(ret)

Results: ['info','xiaoZhang','33','shandong']

Pthon Greed and Greed

Python quantifiers are greedy by default (or in minority languages, by default), always trying to match as many characters as possible; Greedy, on the other hand, always tries to match as few characters as possible.

For example, the regular expression "ab*" will find "abbb" if it is used to find "abbbc". If you use the non-greedy quantifier "ab*", you will find "a".

Note: We usually use non-greedy mode to extract.

Add after'*','+', {m,n}'? Turn greed into greed.

Example 1:

import re
s="This is a number 234-235-22-423"
#In regular expression pattern? To wildcards, it will try to "grab" full when evaluated in left-to-right order? Best match? String, on us? Example???? Will'. +'grab its full length from the beginning of the string? Top of the pattern? Characters, including the ones we want? Of the integer fields? Part,'d+'only? Bit characters match, so it matches the number "4"? ". +" matches from the beginning of the string to this? All characters before digit 4
r=re.match(".+(d+-d+-d+-d+)",s)
print(r.group(1))
#What? The greedy operator'?'. Is this operator OK? In'*','+','? After?, The fewer regular matches required, the better
r=re.match(".+?(d+-d+-d+-d+)",s)
print(r.group(1))

Result:

4-235-22-423
234-235-22-423

Example 2:

>>> re.match(r"aa(d+)","aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(d+?)","aa2343ddd").group(1)
'2'
>>> re.match(r"aa(d+)ddd","aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(d+?)ddd","aa2343ddd").group(1)
'2343'

Example 3: Extracting picture addresses

import re
test_str="<img data-original=https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg>"
ret = re.search(r"https://.*?.jpg", test_str)
print(ret.group())

Result: https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg

r's work

Like most programming languages, regular expressions use "" as an escape character, which can cause backslash problems. If you need to match the characters in the text ", you will need four backslashes"\"in a regular expression expressed in a programming language: the first two and the last two are used to escape to backslashes in a programming language, convert to two backslashes and then to a backslash in a regular expression. Native strings in Python solve this problem well, and in Python strings are preceded by r to denote the original string.

import re
mm = "c:\a\b\c"
print(mm)#c:ac
ret = re.match("c:\\",mm).group()
print(ret)#c:

ret = re.match("c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:a",mm).group()
print(ret)#AttributeError: 'NoneType' object has no attribute 'group'

Topics: Javascript Front-end html