python learning notes | regular expression matching

Posted by adt2007 on Tue, 18 Jan 2022 13:33:54 +0100

This content is based on Al Sweigart With python Automating boring things
 Original link:[use python Automating boring things Chapter 7](https://automatetheboringstuff.com/2e/chapter7/)

This article is only the study notes of the author Ling Wukun, not for commercial behavior

An introduction to regular expressions

Regular expression describes a pattern of string matching, which can be used to check whether a string contains a seed string, replace the matched substring, or take the substring that meets a certain condition from a string.
In short, it is a method to find three consecutive numbers in a string in the form of '\ d\d\d'

Second, regular expression calling module

All regular expression functions in Python are in the re module, so you need to refer to the re module before using regular expressions.

// All regular expression functions in Python are in the re module
import re

Three regular expression calling functions

1 define the form of regular expression -- complex()

Pass a string value representing a regular expression to re Compile () will return a Regex pattern object. We use the Regex pattern object to represent the regular expression and the behavior after completion

phone_num_Regex = re.compile(r'\d\d\d\d\d\d\d\d\d\d\d')

Here, we take the telephone number as an example. It is well known that the telephone number in China is 13 digits. We use r '\ D \ D \ D \ D \ D \ D \ D \ d' to represent the regular expression of the telephone number (the book is in the form of American telephone number)

2 search of regular expression in string -- search()

The search() method of a regular expression object (Regex object) searches for any string that matches the regular expression. If the regular expression pattern is not found in the string, the search() method returns None.

string = input()
phone_num_Regex = re.compile(r'\d\d\d\d\d\d\d\d\d\d\d')
my_phonenum=phone_num_Regex.search(string)
print('Phone number found is'+my_phonenum.group())

When we apply the above code to the problem of finding a phone number, we can find that
When you enter "my phone number is 1234567899876"
The output result is "found phone number is 123456789876"

Note here that through the introduction of re module and the compile function, we construct the Regex object, that is, the required regular expression, and then search the Regex object through search() to get a Match object. At this time, we call the group() of the Match object to return the actual matching value.

3 group expression of regular objects -- group()

When we use the brackets in the regular expression (see the symbol brackets of the four regular expressions), we divide the regular expression into several parts. At this time, we can call the group() function for grouping expression. For example, make the following changes to the definition line of the above code:

phone_num_Regex = re.compile(r'(\d\d\d)(\d\d\d\d\d\d\d\d)')

At this point, we get the regular expression containing two groups, which can be used as my respectively_ phonenum. group (1) and my_phonenum.group(2) represents two groups.

Note 1: my_phonenum.group() represents the first group of the regular expression.
Note 2: the group() function is only used for grouping one regular expression and cannot be used for the search and representation of multiple regular expressions. The findall() function is used to search all regular expressions (see 4 below to search all regular expressions findall()).
Note 3: group() operates on the matched regular expression -- find what the bracket in the regular expression refers to, regardless of whether the bracket is before or after the regular expression.
Note 4: if you want to retrieve all groups at once, use the groups() method -- note the plural form of names.

4 search all regular expressions -- findall()

Search () returns the first regular expression found in the string, and findall () returns each found regular expression. At the same time, unlike search(), findall() does not return a Match object, but a string list (or a cell list if a group is included)

string = input()
phone_num_Regex = re.compile(r'\d\d\d\d\d\d\d\d\d\d\d')
my_phonenum=phone_num_Regex.findall(string)
print(my_phonenum)

Four symbols of regular expressions

Symbol	significance	example
()	Grouping, represented by the group function	re.compile(r'((\d\d\d)) (\d\d\d)')
\|	Represents one of multiple expressions that you want to match	re.compile(r'Bat(man\|mobile\|copter\|bat)')
？	Optional matching	re.compile(r'Bat(wo)?man')
*	Zero or more matches	re.compile(r'Bat(wo)*man')
+	One or more matches	re.compile(r'Bat(wo)+man')
{}	Specific number of repetitions	(Ha){3,5}
^	The match must appear at the beginning of the text	re.compile(r'^Hello')
$	The match must appear at the end of the text	re.compile(r '\ d $') (end of number)

Greedy matching

(Ha){3,5} can match three, four or five instances of Ha in the string 'Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha Ha ha
By default, Python's regular expressions are greedy, which means that in case of ambiguity, they will match strings as long as possible. For example, after the braces, add? Non greediness is obtained.

appendix

1 shorthand character class

Shorthand character class	representative
\d	All numbers
\D	Any character that is not a number
\w	Any letter, number, or underscore character
\W	Any character that is not a letter, number, or underscore
\s	Any spaces, tabs, or line breaks (generally considered matching spaces)
\S	Any character that is not a space, tab, or line feed

2 make your own character class

Define character classes with brackets

//Defines the character class of a vowel letter
vowel_Regex = re.compile(r'[aeiouAEIOU]')
//Defines a character class that contains lowercase letters, uppercase letters, and numbers
ascii_Regex = re.compile(r'[a-zA-Z0-9])

Note 1: character class [0-5.] Match the numbers 0 to 5 and the period, without writing it as [0-5.].
Note 2: you can create a negative character class by placing a caret (^) after the left parenthesis of the character class
Note 3: we use a period To represent all characters except line breaks, such as:

atRegex = re.compile(r'.at')

Through this regular expression, we can match such as cat, rat, sat, etc
Note 4: the dot character indicates "any single character except the newline character", and the asterisk indicates "zero or more of the previous character", so we often use the dot star to indicate everything, such as:

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')

With this regular expression, we can match names

3 case insensitive matching

When we write re. On the second parameter of the compile() function I matches are not case sensitive.

 robocop = re.compile(r'robocop', re.I)
 robocop.search('RoboCop')
 >>>'RoboCop'

summary

Regular expression is a good way to find the specified characters in the string. By calling the related functions of regular expression and making good use of the specified characters, we can achieve efficient search.
Finally, I put the title of the contact project in the book here, and interested readers can experiment by themselves.
(the reference code of the title is introduced in the book)

Project: phone number and email address extractor

Suppose you have a boring task of finding every phone number and email address in a long web page or document. If you scroll the page manually, you may end up searching for a long time. However, if you have a program that can search the clipboard for the text of phone numbers and e-mail addresses, you just press CTRL- A Select all text and press CTRL- C Copy it to the clipboard and run your program. It can replace the text on the clipboard with only the phone number and e-mail address it finds.

Whenever you work on a new project, it's easy to dive directly into writing code. But usually, it's best to step back and consider a larger picture. I suggest that you first make a high-level plan for what your program needs to do. Don't think about the actual code for now -- you can think about it later. Now, stick to a wide range of strokes.

For example, your phone and e-mail address extractor needs to do the following:

1.Gets the text from the clipboard.
2.Find all phone numbers and e-mail addresses in text.
3.Paste them on the clipboard.

Now you can start thinking about how this works in your code. This code will need to do the following:

1.use pyperclip Module copies and pastes strings.
2.Create two regular expressions, one to match a phone number and the other to match an e-mail address.
3.Find all matches of two regular expressions, not just the first match.
4.Neatly format the matching string into a single string for pasting.
5.If no match is found in the text, a message is displayed.
This list is like a roadmap for the project. As you write code, you can focus on each of these steps separately. Each step is quite easy to manage and based on what you already know how to Python To express the actions performed in.

Topics: Python

Programmer Think