Introduction to regular expressions
Regular expression is a special character sequence, which is usually used to retrieve and replace the text that conforms to a certain pattern (rule).
re module is used in python.
characteristic
- Very flexible, logical and functional
- It can quickly achieve the complex control of string in a very simple way
Regular expressions in python
import re word = 'h230dsdkf349dsd0)' # Take the number inside print(re.findall(r'\d+', word)) # r is to invalidate the escape character. If you don't use r, you need to add one more\ print(re.sub(r'\d', "@", word))
['230', '349', '0'] h@@@dsdkf@@@dsd@)
Regular search
1. search method
import re word = 'h2303dsdkf349dsd0)' # Take the number inside # word.index('3') # String method can only find one character result = re.search(r'3\d+', word) print(result) # \d represents any number + represents one or more occurrences result = re.search(r'3\d', word) print(result)
<re.Match object; span=(2, 5), match='303'> <re.Match object; span=(2, 4), match='30'>
2. match method
match matches from the beginning of the string
print(re.match(r'3\d+', word)) print(re.match(r'h\d+', word))
None <re.Match object; span=(0, 5), match='h2303'>
fullmatch complete match
print(re.fullmatch(r'h\d+', word)) print(re.fullmatch(r'h\d+', 'h12312324'))
None <re.Match object; span=(0, 9), match='h12312324'>
3. findall / finditer matches multiple
x = re.finditer(r'3\d+', word) # The result is an iterator for i in x: print(i)
<re.Match object; span=(2, 5), match='303'> <re.Match object; span=(10, 13), match='349'>
print(re.findall(r'3\d+', word)) # The result is a list
['303', '349']
re.Match type
# The results obtained by match / search / fullmatch / finder of re module are re Match type x1 = re.search(r'\d+', 'ab12cd34ef567g8901') print(dir(x1))
['__class__', '__class_getitem__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
print(x1.end()) # Gets the end index of the matched string print(x1.endpos) # Gets the length of the entire string print(x1.span()) # The (start, end) index of the matched string print(x1.string) # The full string found print(x1.group()) # Get the matching string
4 18 (2, 4) ab12cd34ef567g8901 12
Grouping of regular expressions
import re x1 = re.search(r'b\d+cd\d+ef\d+g\d+', 'ab12cd34ef567g8901') print(x1.group()) x2 = re.search(r'(b\d+)(cd\d+)(ef\d+)(g\d+)', 'ab12cd34ef567g8901') print(x2.group()) # Although the above two results are the same, there is only one group in the top and several groups in the bottom print(x2.group(1)) print(x2.group(2)) print(x2.group(0)) # group(0) is equivalent to group()
b12cd34ef567g8901 b12cd34ef567g8901 b12 cd34 b12cd34ef567g8901
print(x1.groups()) print(x2.groups()) # Get a tuple composed of strings matched by all subgroups
() ('b12', 'cd34', 'ef567', 'g8901')
Regular expressions use (? P < name >...) syntax to name groups
x3 = re.search(r'(?P<name1>ab\d+)(?P<name2>cd\d+)(?P<name3>ef\d+)(?P<name4>g\d+)', 'ab12cd34ef567g8901') print(x3.lastgroup) # Get the group name of the last group print(x3.groupdict()) # Save the grouped data with group name in the form of dictionary
name4 {'name1': 'ab12', 'name2': 'cd34', 'name3': 'ef567', 'name4': 'g8901'}
re.compile
Can replace regular rules
patten = re.compile(r"ab\d+") print(patten.search('ab1234'))
Regular modifier
print(re.search(r'a', 'ABC')) print(re.search(r'a', 'ABC', re.I)) # re.I is the regular modifier
None <re.Match object; span=(0, 1), match='A'>
Regular modifier | meaning | |
---|---|---|
re.A | re.ASCII | Match ASCII only |
re.I | re.IGNORECASE | Ignore case matching |
re.L | re.LOCALE | The \ w, \W, \b, \B and case sensitive matches are determined by the current language region |
re.M | re.MULTILINE | The beginning of each line and the character immediately after '^' are matched; The style character '$' matches the end of the string and the end of each line (the symbol before the newline character). By default, '^' matches the beginning of the string and '$' matches the end of the string. |
re.S | re.DOTALL | Let's go Special characters match any character, including line breaks |
re.X | re.VERBOSE | When there are # non character sets and escape sequences in a line, all characters after it are comments |
Regular matching rule
- Numbers and letters represent itself and have no special meaning
- \It has special meaning and is used to escape
- Most letters have a special meaning after they are preceded by \ escape
- Most punctuation marks have special meanings
- If you want to represent the punctuation itself, you need to use the \ escape character
Special meaning of escape letter
pattern | describe |
---|---|
\w | Match alphanumeric underscores |
\W | Match non numeric alphabetic underscores |
\s | Match any white space character, equivalent to [\ n\r\f]. |
\S | Match any non null character |
\d | Match any number, equivalent to [0-9]. |
\D | Match any non number |
\A | Start of matching string |
\Z | The matching string ends. If there is a newline, only the end string before the newline is matched. |
\z | End of matching string |
\G | Match the position where the last match was completed. |
\b | Match a word boundary, that is, the position between the word and the space. For example, 'er\b' can match 'er' in 'never', but not 'er' in 'verb'. |
\B | Matches non word boundaries‘ er\B 'can match' er 'in' verb ', but cannot match' er 'in' never '. |
\n. \ T, wait. | Match a newline character. Match a tab, etc |
\1...\9 | Match the contents of the nth group. |
\10 | Match the content of the nth group if it is matched. Otherwise, it refers to the expression of octal character code. |
print(re.findall(r'\s', 'hello world, hello python'))
[' ', ' ', ' ']
Special meaning of punctuation
‘_’ No longer escaped
Only characters with special meaning in regular expressions are escaped. So, '!'‘ ','% ',' '', ',' / ',': ','; ',' < ',' = ',' > ',' @ 'and' ` 'will no longer be escaped.