- Regular expression
- Overview
- Regular expression defines the pattern of string, which can be used to search, edit and process text. It is not limited to one language, but there will be subtle differences in each language jdk1 4 launch regular expressions and save them in Java util. Regex package
- Grammar
Metacharacter | describe |
\ | Mark the next character, or a backward reference, or an octal escape character. For example, "\ \ n" matches \ n. "\ n" matches the newline character. The sequence "\ \" matches "\" and "\ (" matches "("), which is equivalent to the concept of "escape character" in many programming languages. |
^ | Matches the start of the input string. If the RegExp object's Attribute, ^ also matches the position after "\ n" or "\ r". |
$ | Matches the end of the input string. If the Multiline property of RegExp object is set, $also matches the position before "\ n" or "\ r". |
* | Match the previous subexpression any number of times. For example, zo * can match "z", "zo" and "zoo"* Equivalent to {0,}. |
+ | Match the previous subexpression one or more times (greater than or equal to 1 time). For example, "zo +" can match "zo" and "zoo", but not "z"+ Equivalent to {1,}. |
? | Matches the previous subexpression zero or once. For example, "do(es)" " You can match "do" in "do" or "does".? Equivalent to {0,1}. |
{n} | N is a nonnegative integer. Match the determined n times. For example, "o{2}" cannot match "O" in "Bob", but can match two o's in "food". |
{n,} | n is a nonnegative integer. Match at least n times. For example, "o{2,}" cannot match "O" in "Bob", but can match all o in "fooood". "o{1,}" is equivalent to "O +". "o{0,}" is equivalent to "O *". |
{n,m} | M and N are non negative integers, where n < = M. Match at least N times and at most m times. For example, "o{1,3}" will match the first three o's in "food". "o{0,1}" is equivalent to "O?". Please note that there can be no space between comma and two numbers. |
. point | Matches any single character except '\ r\n'. To match any character including '\ r\n', use a pattern like '[\ s\S]'. |
(pattern) | Match pattern and get this match. The obtained Matches can be obtained from the generated Matches set. The SubMatches set is used in VBScript and the $0... $9 attribute is used in JScript. To match parenthesis characters, use '\ (' or '\)'. |
(?:pattern) | The pattern is matched but the matching result is not obtained, that is, it is a non obtained match and will not be stored for later use. This is useful when using the or character "(|) to combine parts of a pattern. For example, "industry (?: y|ies)" is a simpler expression than "industry|industries". |
(?=pattern) | Positive prefetch matches the lookup string at the beginning of any string matching pattern. This is a non fetched match, that is, the match does not need to be fetched for later use. For example, "Windows(?=95|98|NT|2000)" can match "windows" in "Windows2000", but cannot match "windows" in "Windows3.1". The pre check does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, rather than after the characters containing the pre check. |
(?!pattern) | Positive negative pre query matches the lookup string at the beginning of any string that does not match the pattern. This is a non fetched match, that is, the match does not need to be fetched for later use. For example, "Windows(?!95|98|NT|2000)" can match "windows" in "Windows3.1", but cannot match "windows" in "Windows2000". |
(?=pattern) | Reverse positive pre check is similar to forward positive pre check, but in the opposite direction. For example, "(? < = 95|98|nt|2000) Windows" can match "Windows" in "2000Windows", but cannot match "Windows" in "3.1 Windows". |
(?<!pattern) | Reverse negative pre check is similar to positive negative pre check, but in the opposite direction. For example, "(? <! 95|98|nt|2000) Windows" can match "Windows" in "3.1 Windows", but cannot match "Windows" in "2000 Windows". |
x|y | Match x or y. For example, "z|food" can match "Z" or "food" or "zoo" (please be careful here). "(z|f)ood" matches "zoo" or "food". |
[xyz] | Character set. Match any character contained. For example, "[abc]" can match "a" in "plain". |
[^xyz] | Negative character set. Matches any characters that are not included. For example, "[^ abc]" can match "plin" in "plain". |
[a-z] | Character range. Matches any character within the specified range. For example, "[a-z]" can match any lowercase character from "a" to "z". Note: the range of characters can be represented only when the hyphen is inside the character group and appears between two characters; If the beginning of the character group is marked, it can only represent the hyphen itself |
[^a-z] | Negative character range. Matches any character that is not within the specified range. For example, "[^ a-z]" can match any character that is not in the range of "a" to "z". |
\b | Match a word boundary, that is, the position between the word and the space (that is, the "match" of regular expression has two concepts, one is the matching character and the other is the matching position. Here \ b is the matching position). For example, "er\b" can match "er" in "never", but not "er" in "verb". |
\B | Matches non word boundaries. "er\B" can match "er" in "verb", but cannot match "er" in "never". |
\cx | Matches the control character indicated by x. For example, \ cM matches a Control-M or carriage return. The value of x must be either A-Z or one of A-Z. Otherwise, c is treated as a literal "c" character. |
\d | Match a numeric character. Equivalent to [0-9]. |
\D | Matches a non numeric character. Equivalent to [^ 0-9]. |
\f | Match a page feed. Equivalent to \ x0c and \ cL. |
\n | Match a newline character. Equivalent to \ x0a and \ cJ. |
\r | Match a carriage return. Equivalent to \ x0d and \ cM. |
\s | Matches any invisible characters, including spaces, tabs, page breaks, and so on. Equivalent to [\ f\n\r\t\v]. |
\S | Matches any visible character. Equivalent to [^ \ f\n\r\t\v]. |
\t | Match a tab. Equivalent to \ x09 and \ cI. |
\v | Match a vertical tab. Equivalent to \ x0b and \ cK. |
\w | Matches any word characters that include underscores. Similar but not equivalent to "[A-Za-z0-9#]", The "word" character here uses the Unicode character set. |
\W | Matches any non word characters. Equivalent to "[^ A-Za-z0-9]". |
\xn | Match n, where n is the hexadecimal escape value. Hexadecimal escape value must be two digits long. For example, "\ x41" matches "A". "\ x041" is equivalent to "\ X04 & 1". ASCII encoding can be used in regular expressions. |
\num | Match num, where num is a positive integer. A reference to the match obtained. For example, "(.)\ "1" matches two consecutive identical characters. |
\n | Identifies an octal escape value or a backward reference. If \ n at least n previously obtained subexpressions, then n is a backward reference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value. |
\nm | Identifies an octal escape value or a backward reference. If there are at least nm subexpressions before \ nm, nm is a backward reference. If there are at least n fetches before \ nm, then n is a backward reference followed by the word M. If the previous conditions are not met, if n and m are octal digits (0-7), \ nm will match the octal escape value nm. |
\nml | If n is an octal digit (0-7), and both m and l are octal digits (0-7), the octal escape value nml is matched. |
\un | Match n, where n is a Unicode character represented by four hexadecimal digits. For example, \ u00A9 matches the copyright symbol (& copy;). |
\< \> | Match the beginning (\ <) and end (\ >) of the word. For example, the regular expression \ < the \ > can match the "the" in the string "for the wise", but cannot match the "the" in the string "otherwise". Note: this meta character is not supported by all software. |
\( \) | Define the expression between \ (and \) as "group", and save the characters matching this expression to a temporary area (up to 9 characters can be saved in a regular expression), which can be referenced by symbols from \ 1 to \ 9. |
| | The two matching conditions are logically Or operated. For example, a regular expression (him|her) matches "it belongs to him" and "it belongs to her", but cannot match "it belongs to them.". Note: this meta character is not supported by all software. |
+ | Matches one or more characters just before it. For example, regular expression 9 + matches 9, 99, 999, etc. Note: this meta character is not supported by all software. |
? | Match 0 or 1 characters just before it. Note: this meta character is not supported by all software. |
{i} {i,j} | Matches a specified number of characters defined by the expression before it. For example, regular expression A[0-9]{3} can match a string with exactly three numeric characters after character "a", such as A123, A348, etc., but does not match A1234. The regular expression [0-9] {4,6} matches any four, five or six consecutive numbers |
The backslash in the string of Java source code is interpreted as Unicode escape or other character escape. Therefore, two backslashes must be used in the string literal to indicate that the regular expression is protected from interpretation by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\ b" matches a single backspace character and "\ \ b" matches the word boundary. The string literal "\ (hello \)" is illegal and will cause a compile time error; To match a string (hello), you must use the string literal "\ \ (hello \ \)".
(), [], {}
1>. Parentheses (): matches the string in parentheses, which can be one or more
2>. Brackets []: match the characters in the character group, such as [0-9a-zA-Z. *?!] The characters in [] are all characters, not meta characters. For example, the "-" in the middle of "0-9" and "a-z" is the connecting symbol, which indicates the meta character of the range. If it is written as [-!? * (], it is an ordinary character
3>. Braces {}: the number of matches, the number of times the element matched by the expression before it appears, {n} appears n times, {n,} matches at least N times, {n,m} matches at least N times, and at most m times
Regular expressions in Java
java.util.regex package
PatternSyntaxException is a non mandatory exception class that represents a syntax error in a regular expression pattern
Pattern
- It is used to create a regular expression or a matching pattern. Its construction method is private and cannot be created directly
- You can use pattern Complex (string regex) creates a regular expression
- You can only do some simple matching operations
use:
Pattern.split(CharSequence input), member method, used to separate strings
-
-
- Matcher
-
- The construction method is also private and cannot be created at will. It can only be created through pattern The matcher (charsequence input) method obtains the instance of this class, Matcher m = p.matcher("aaaaab");
- Support convenient and powerful regular matching operations, including grouping and multiple matching support
-
-
- String matching
-
-
Matcher.matches(): matches the whole string. Only when the whole string matches, it returns true
Matcher.lookingAt(): matches the previous string, and returns true only if the matched string is at the front
Matcher.find(): match the string. The matched string can be in any position
/ / word: matches 1
public static void test1() { String input = "13113113111"; String regex = "\\d{11}"; // Create regular expression engine object Pattern pattern = Pattern.compile(regex); // Create matcher Matcher matcher = pattern.matcher(input); // matching boolean result = matcher.matches(); System.out.println(result); }
/ / 2 lookingAt: match from front to back. If the front meets the conditions
public static void test2() { String input = "13113113111asd"; String regex = "\\d{11}"; // Create regular expression engine object Pattern pattern = Pattern.compile(regex); // Create matcher Matcher matcher = pattern.matcher(input); // matching boolean result = matcher.lookingAt(); System.out.println(result); }
/ / 3 find: any position that meets the conditions can be found
public static void test3() { String input = "asd13113113111asd"; String regex = "\\d{11}"; // Create regular expression engine object Pattern pattern = Pattern.compile(regex); // Create matcher Matcher matcher = pattern.matcher(input); // matching boolean result = matcher.find(); System.out.println(result);
-
-
-
- Data extraction
-
-
// 4. Group: find and group are used together to extract data public static void test4() { String input = "Zhang Xiaosan's telephone number is 13111 s@##Li Si's telephone number is 13112 and Wang Wu's telephone number is 13113 "; // [\ u4E00-\u9FFF] Chinese character range String regex = "([\u4E00-\u9FFF]{2,3})Your phone number is(\\d{11})"; // Create regular expression engine object Pattern pattern = Pattern.compile(regex); // Create matcher Matcher matcher = pattern.matcher(input); // matching while (matcher.find()) { // Both group() and group(0) extract the matched data // 1 is the first set of data (the first parenthesis), and 2 is the second set of data // System.out.println(matcher.group()); // System.out.println(matcher.group(0)); System.out.println(matcher.group(1)+" : "+matcher.group(2)); } }
- Reduplication
// Yuancheng: I want to learn programming String input = "I, I...I, I...i want..Yes, yes...Yes, yes...Learn to learn....Learn...Make up..programming..Cheng.Cheng Cheng...Cheng...Cheng"; // 1 Remove input = input.replaceAll("[^\u4E00-\u9FFF]", ""); // I want to learn programming System.out.println(input); // (.): any character // \1 get the data in the previous group // (\ \ d)\1: indicates two consecutive numbers, such as 11,22,33,44 // (\ \ d)(a)\1: indicates that the first and third numbers are the same, and there are a, 1A1, 9a9 in the middle of the number // (\ \ d)(a)\2: indicates that the first is a number, and the second and third are both a,1aa,3aa String regex = "(.)(\\1+)"; // Create regular expression engine object Pattern pattern = Pattern.compile(regex); // Create matcher Matcher matcher = pattern.matcher(input); // Find find // while (matcher.find()) { // //Use group(1) to replace group(0) // input = input.replaceAll(matcher.group(), matcher.group(1)); // } // $1 is group(1), and regex is group(); input = input.replaceAll(regex, "$1"); System.out.println(input);