python regular expression II

Posted by seikan on Tue, 08 Feb 2022 21:39:41 +0100

Special meaning of punctuation

Special characters	Value
.	(DOT) in the default mode, matches any character except line feed. If the label DOTALL is specified, it will match any character including a newline character.
^	(CARET) matches the beginning of the string, and also matches the first symbol after line feed in multi line mode.
$	Matching the end of the string or the character before the newline character at the end of the string will also match the text before the newline character in multi line mode.
1.	Match the regular formula in front of it for 0 to any number of repetitions, and try to match as many strings as possible.
+	Match the regular expression in front of it for 1 to any repetition.
?	Match the regular expression in front of it for 0 to 1 repetitions.
*?, +?, ??	'*', '+', and '?' Modifiers are greedy; They match as many strings as possible.
{m}	Specify m repetitions to match the previous regular formula; Less than m will lead to matching failure.
{m,n}	Match the regular Formula m to N times, and take as many as possible between m and n.
{m,n}?	The non greedy pattern of the previous modifier matches only as few characters as possible.
\	Escaping Special Characters
[]	Used to represent a set of characters.
\|	A\|B, once a matches successfully, B will no longer match
(...)	(combination), match any regular expression in parentheses, and identify the beginning and end of the combination.
(?...)	This is an extended notation (a '?' followed by '(' has no meaning). '?' The following first character determines what syntax this build uses.
(?aiLmsux)	(one or more of 'a', 'i', 'L','m ','s',' u ',' x ') this combination matches an empty string;
(?:...)	Non captured version of regular parentheses. Matches any regular expression in parentheses, but the substring matched by the grouping cannot be obtained after matching or referenced in the pattern.
(?aiLmsux-imsx:...)	(0 or more of 'a', 'i', 'L','m ','s',' u ',' x ', followed by' - 'followed by one or more of' i ','m','s', X.) These characters set or remove the corresponding tags for part of the expression
(?P...)	(named combination)
(?P=name)	Back reference a named combination; It matches the same string found in the previous named group called name.
(?#...)	notes; The contents will be ignored.
(?=...)	Match the content of... But do not consume the content of style. For example, Isaac (?=Asimov) matches' Isaac 'only when it is followed by' Asimov '.
(?!...)	Match... A situation that does not match.
(?<=...)	Matches the current position of the string, and its front matches the contents of... To the current position.
(?<!...)	Match a style that was not before the current location.
(?(id/name)yes-pattern\|no-pattern)	If the given id or name exists, it will try to match yes pattern, otherwise it will try to match no pattern. No pattern is optional or can be ignored.

type	[] usage
1	Characters can be listed separately. For example, [amk] matches' a ','m', or 'k'
2	Can represent a character range by connecting two characters with '-'. For example, [a-z] will match any lowercase ASCII character, [0-5] [0-9] will match two digits from 00 to 59, [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (such as [a-z]) or its position is at the first or end (such as [- A] or [a -]), it only represents the ordinary character '-'.
3	A special character loses its special meaning in a set. For example [(+ )] will only match these grammatical characters' (',' + ',' ', or').
4	Character classes such as \ w or \ S (as defined below) are acceptable in the set. The characters they can match are determined by ASCII or local mode.
5	Characters that are not in the range of the set can be matched by inversion. If the first character of the set is' ^ ', all characters not in the set will be matched. For example, [^ 5] will match all characters except' 5 ', [^ ^] will match all characters except' ^ '^ If it is not in the first place of the set, it has no special meaning.
6	There are two ways to match a character ']' in a set, either by adding a backslash right before it or putting it at the top of the set. For example, [() [] {}] and [{}] can match parentheses.

practice

User name matching: it is composed of numbers, letters, underscores and horizontal lines. It is 4 to 14 digits long and cannot start with a number.

import re

while True:
    user_name = input('enter one user name')
    x = re.compile(r'^[a-zA-Z_-][\w-]{3,13}$')
    if x.match(user_name):
        print('User name entered successfully{}'.format(user_name))
        break
    else:
        print('The user name is illegal, please re-enter')

Matching mailbox
r'^[\w.-]+@[\w.-]\.[a-zA-z]{2,4}$'
Match mobile number
r'^1\d{10}$'

Regular substitution

a = 'ad234ds3dsf2'
print(re.sub(r'\d+', '0', a))  # Replace the number with 0


# Double the number by function replacement
def Doub(x):
    data = x.group()
    return str(int(data) * 2)


print(re.sub(r'\d+', Doub, a))

# Or use lambda expressions
print(re.sub(r'\d+', lambda x: str(int(x.group()) * 2), a))

Greedy model

a = 'ad234acv234dsf2345'
result = re.match(r'ad(.+)(\d+)', a)
print(result.group(1))
result1 = re.match(r'ad(.+)(.+)', a)
print(result1.group(1))
result2 = re.match(r'ad(\d+)(.+)', a)
print(result2.group(1))

python regular expressions default greedy pattern (as many matches as possible)

Non greedy model

a = 'ad234acv234dsf2345'
result = re.match(r'ad(.+?)(\d+)', a)
print(result.group(1))
result1 = re.match(r'ad(.{2,5}?)(.+)', a)
print(result1.group(1))
result2 = re.match(r'ad(\d+?)(.+)', a)
print(result2.group(1))

task

User name matching, requirements:
- The user name can only contain numbers, letters, underscores
- Cannot start with a number
- The length is in the range of 6 to 16 bits

import re


def user_name():
    while True:
        name = input('enter one user name')
        if re.match(r'^[a-zA-z_]\w{5,15}$', name):
            print('Correct input')
            break
        else:
            print('Illegal input')


user_name()

Password matching, requirements:
- Cannot contain@# ￥% ^ & * these special symbols
- Must start with a letter
- The length is 6 to 12 bits

import re


def pass_word():
    while True:
        p_word = input('Please input a password')
        if re.fullmatch(r'[a-zA-z][^!@#￥%^&*]{5,11}', p_word):
            print('Correct input')
            break
        else:
            print('Illegal input')


pass_word()

Known file test Txt is as follows:

Chen XX Caoyan 6895 13811661805 caoyan@baidu.com
 often XX Yu Cao 8366 13911404565 caoyu@baidu.com
 firewood XX Shirley Cao 6519 13683604090 caoyue@baidu.com
 Cao XX Cao Zheng 8290 13718160690 caozheng@baidu.com
 check XX Zha Lingli 6259 13552551952 zhalingli@baidu.com
 check XX Zha Shan 8580 13811691291 zhanshan@baidu.com
 check XX Rachel 8825 13341012971 zhanyu@baidu.com

Extract all mobile phone numbers and mailboxes in the file

import re

file_name = r'H:\pythonProject\test'

# Define mobile number extraction rules
get_tel = re.compile(r'\d{11}')
# Define mailbox extraction rules
get_mail = re.compile(r'\w+@baidu\.com')


# read file
with open(file_name, encoding='utf-8') as f:
    list1 = f.readlines()
    str_l = ' '.join(list1)

print('All phone numbers are{}'.format(get_tel.findall(str_l)))
print('All mailboxes are{}'.format(get_mail.findall(str_l)))

Programmer Think

python regular expression II

Special meaning of punctuation

practice

Regular substitution

Greedy model

Non greedy model

task

Hot Topics