Regular Expressions for Pattern Matching
What is a regular expression?
A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. Regular expressions are also described in the Perl documentation and in a number of other books and online resources, some of which have copious examples. There are many web sites that serve as online repository of useful regular expressions. The description here is intended as introductory documentation only.
Introduction
A regular expression, or regex for short, is a pattern describing a certain amount of text. In this document, regular expressions are highlighted in bold red as regex. Term "string" is used to indicate the text that regular expression is applied to. Text strings will be highlighted as follows: “Text string”.
The simplest form of regular expression is actual literal text. For example, regex Chapter matches text strings containing Chapter sub-string. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in a different way.
Character types
Backslash can be used to specify generic character types:
\d any decimal digit
\D any character that is not a decimal digit
\s any whitespace character
\S any character that is not a whitespace character
\w any "word" character (A "word" character is any letter or digit or the underscore character)
\W any "non-word" character
For example: \d{8} matches exactly 8 digits.
Matching alternatives
Vertical bar characters are used to separate alternative patterns. For example, the pattern Configuration|Settings matches either "Configuration" or "Settings". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used.
Sub-Patterns
Sub-patterns are delimited by parentheses (round brackets), which can be nested. For example, the pattern ((red|white) (BMW|Volvo)) matches all combinations of "red" and "white" with words "BMW" and "Volvo" (i.e. "red BMW" or "white Volvo"). Another example: (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility is used, it does match "sense and responsibility" as well as the other two strings. The meta-character \1 here serves as a back reference to the first matching sub-pattern. Such references must, however, follow the sub-pattern to which they refer.
Matching whole words
Simple text patterns such as Alert are also going to match words Alerts, Alerted and etc. If you want your pattern to match only whole words, surround it with \b meta-characters. For example, use \bAlert\b to match only word Alert and exclude all other words that might contain it as a sub-string.
Matching sub-string
If text that you want to match should appear only inside bigger word, use \B meta-character. For example, the pattern \Bword\B will match word "swordfish", but will ignore words "word", "words" and "password".
Repetitions
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: z{2,4} matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches.
Character Classes or Character Sets
A "character class" matches only one out of several characters. To match an “a” or an “e”, use [ae]. You could use this in gr[ae]y to match either gray or grey. A character class matches only a single character. gr[ae]y will not match graay, graey or any such thing. The order of the characters inside a character class does not matter. You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X.
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. q[^x] matches qu in question. It does not match Iraq since there is no character after the q for the negated character class to match.
Using Anchors to Match Text Lines
Anchors do not match any characters. They match only a particular text position in the string. Meta-character ^ matches at the start of the string, and $ matches at the end of the string. Symbol \b matches at a word boundary. E.g. ^b matches only the first b in bob. A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w. Meta-character \b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters. \B matches at every position where \b cannot match.
Examples:
Chapter \d$ - matches Chapter 1 , but does not match Chapter 1 Appendix
^Chapter \d - matches Chapter 1 , but does not match In the Chapter 1
Chapter\b - matches Chapter or Chapter 1, but does not match Chapters
Partager