Use of python crawler learning record basic library -- regular expression

Posted by ruraldev on Sun, 16 Jan 2022 19:24:33 +0100

Regular expression test tool: http://tool.oschina.net/regex

1. Common grammar

https://www.runoob.com/regexp/regexp-syntax.html

2,match()

By passing in the string and regular expression to be matched with match, you can check whether the regular expression matches the string.

import  re
content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}',content)
print(result)
print(result.group())
print(result.span())

41
<re.Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)

Regular expression ^ starts with matching string and starts with Hello; \s matches white space characters; \d matches numbers, 3 \ d matches 123; \s matches spaces; 4567 use four \ d matches, and use {4} to represent 4 matches; Followed by a blank character, and finally match 10 letters and underscores with \ w{10}.

(1) Match target

If you want to extract part of the string, you can use () to enclose the substring you want to extract. () mark the beginning and end of a sub expression. Each sub expression marked will correspond to each group at one time. Call the group () method to pass in the index of the group to obtain the extracted result.

(2) Greedy and non greedy matching

Greedy match:*

import  re
content = 'Hello 123 4567 World_This is a Regex Demo'

result = re.match('^He.*(\d+).*Demo$',content)
print(result)
print(result.group(1))

<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
7

. * will match as many characters as possible because* After that is \ d +, which means at least one number. There are no specific numbers, so we match 123456 characters as many as possible, leaving only 7

Non greedy matching:. *?

import  re
content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^He.*?(\d+).*Demo$',content)
print(result)
print(result.group(1))

<re.Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567

When. *? Accompany the blank character after hello, and then the number, \ d + can match exactly,. *? It will no longer match. Give it to the number after \ d + match, 1234567

Try to use. *?, Avoid missing matching results

However, if the matching result appears at the end of the string, use* Match as much content as possible

import  re
content = 'http://weibo.com/comment/kEraCN'

result1 = re.match('http.*?comment/(.*?)',content)
result2 = re.match('http.*?comment/(.*)',content)
print('result1:',result1.group(1))
print('result2:',result2.group(1))

result1:
result2: kEraCN

(3) Modifier

Regular expressions can contain optional flag modifiers to control matching patterns. The modifier is specified as an optional flag.

import  re
content ='''Hello 1234567 World_This
is a Regex Demo
'''

result = re.match('^He.*?(\d+).*?Demo$',content)
print(result.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

. matches any character except the newline character. When a newline is encountered. *? You can't match. You need to add re S decoration error. This can match line breaks between nodes.

import  re
content ='''Hello 1234567 World_This
is a Regex Demo
'''

result = re.match('^He.*?(\d+).*?Demo$',content,re.S)
print(result.group(1))

At this time, 1234567 can be output.

re.I # matching pairs are case insensitive

re.L ^ localization matching

re.M ^ multi line matching, affecting ^$

re.S , make Matches all characters, including line breaks

re.U \ parses characters according to the unicode character set. This flag affects \ w,\W,\b,\B

re.X) by giving more flexible formats to make regular expressions easier to understand

(4) Escape matching

Escape special characters with \

import  re
content ='(Baidu)www.baidu.com'

result = re.match('\(Baidu\)www\.baidu\.com',content)
print(result)

<re. Match object; Span = (0, 17), match = '(Baidu) www.baidu.com com'>

3,search()

match starts at the beginning of the string. search scans the whole string during matching and returns the first successful matching result. Regularization can make part of the string.

import re

content = 'Extra string Hello 1234567 World_This is a Regex Demo Extra strings'
result = re.search('Hello.*?(\d+).*?Demo',content)
print(result)
print(result[1])

<re.Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
1234567

4,findall()

Get all the contents matching the regular expression. If there is a return result, it is the list type. You need to traverse to get each group of contents at one time.

5,sub()

You can modify the text

import re

content = '54K54yr5oiR54ix5L2g'
content = re.sub('\d+','',content)
print(content)

The results are as follows

KyroiRixLg

The first parameter \ d + matches all numbers, the second parameter represents the replaced string, and the third parameter is the original string.

It can be processed by sub method and extracted by findall method.

6,compile()

Regular strings can be compiled into regular expression objects

import re

content1 = '2016-12-15 12:00'
content2 = '2016-12-17 12:55'
content3 = '2016-12-22 13:21'
pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern,'',content1)
result2 = re.sub(pattern,'',content2)
result3 = re.sub(pattern,'',content3)
print(result1,result2,result3)

2016-12-15  2016-12-17  2016-12-22

We want to remove the time from the three dates. We can use the sub method to compile the regular expression into an object pattern with compile, and finally use the sub method to replace it.

Topics: Python