Today I want to talk about regular expressions (usually referred to as regex or regexp). No matter what application you are creating, chances are you will need to parse text in some way. It might be for validating user input or for extracting information from a string of data in some arbitrary format. I have yet to work on any project where regex was not required.
About Regular Expressions
Regex is a powerful language used to process text. It allows you to define a pattern that a regex engine uses to examine a string of data. The engine applies your pattern to the supplied string and matches the text that was specified in the pattern. What you do with regular expressions depends on the circumstance:
- Matching / Counting: Check if a string matches a pattern. For example, check if the user inputted a correctly formatted email address.
- Replacement: Replacing parts of a string with another. For example, parsing BB-Code into HTML.
- Extraction: Extracting parts of a string. For example, you might want to extract all of the href's in an HTML document.
In this post I am only going over the regex language briefly. See the end for some links to further reading. This article is simply a precursor for another post I wanted to write on using regex with PHP.
The Basics
A regular expression pattern is made up of several simple parts:
- Characters: The actual characters you want to match. You can insert literal strings like "chris", or define a list of characters ("only a, e, i, o and u"), or use the wildcard meta-character to match anything. There are also sets of character types you can use like "any digit" or "any whitespace character".
- Alteration: Used to define a set of alternatives like "chris or christopher or christoph".
- Quantification: Used to explain how many times a character or characters should appear. For example, "only a, e, i, o and u once".
- Grouping: Group a part of a pattern into larger chunks, to define scope, and to specify quantity of a larger chunk.
- Assertions: An expression that is applied to the left or right (that is, before or after) the current matching position. This makes it possible to do patterns like "chris not followed by 'topher'".
- Anchoring: Anchoring a pattern to the start of end of a string lets you define the context; where you want the pattern to match. For example, "match 'chris' at the beginning of the string".
Characters
There are several ways you can define characters that you want to match.
The first way of course is a literal string:
-
chris
This would match the string "chris", "christopher", "thechris" etcetera.
The second way is to define a character class. A character class is a list of characters that can (or can not) be matched:
-
[aeiou]
-
[^aeiou]
-
[a-z0-9]
-
[a-z0123456789]
-
[a-z0-9\-]
The first will match any vowel. The caret (^) in the second example means "not", so it makes the pattern match anything but a vowel. The third example you see the use of a dash. This creates a range of characters. "a-z" means "any letter from a to z", just like "0-9" means "any number from 0-9". Thus, the two last patterns mean the exact same thing. If you want to insert a literal dash (ie. "match a dash character") you must escape it, as demonstrated in the last example.
The third way is to use the wildcard character, or a pre-defined character type:
-
.
-
\w
-
\d
The first pattern (the dot) simply matches any character. The second pattern is the special escape sequence that means "any word character" (a word character is something like letters or numbers). The final patterns is another escape sequence that means "any digit" (that is, any number 0-9).
You can combine these to make up a fairly complex pattern:
-
\wchris\d
This would match "tchris9" and "zchris7", but not "chris", "chris98" or "zchris". You might wonder why the latter strings would not match. It is because we have not defined any rules for repetition, so the pattern literally means "a single word character followed by the string 'chris' followed by a single number".
Alteration
Alteration is a simpler concept. You simply use the pipe character to separate alternate expressions:
-
chris|christopher|christoph
-
color|colour
You will see some more useful examples of alternation soon when we talk about grouping.
Quantification
There are three ways you can define the quantity of characters.
First way is by providing no quantification at all. When there is none, it means once:
-
[a-z]
-
[a-zA-Z]
The first pattern means "one lowercase letter" and the second means "one lower or uppercase letter".
The second way is by using a meta-character. There are three different meta-characters to choose from:
-
[a-z]*
-
u?
-
\d+
- The asterisk (*) means "zero or more times". The first pattern means "any letter, any number of times".
- The question mark (?) means "zero or one". The second pattern means "u is optional".
- The plus sign (+) means "one or more". The third pattern means "any digit one or more times".
The third way is by explicitly defining the minimum and maximum times the character can be repeated:
-
[a-z]{1,5}
-
u{1}
-
\d{,5}
The format is {min,max}. Either of the numbers can be excluded. For example, by not defining the maximum number, you just define the least number of times the character matches. The fist pattern means "any lowercase letter 1 to 5 times". The second pattern means "exactly 1 u". The third pattern means "at most 5 digits" (since there is no minimum, this would also match no digit!).
Grouping
Grouping characters is done with parenthesis. There are three situations where you might want to group characters together.
The first is to define the scope for an alteration:
-
\wchris|christopher\d
-
\w(chris|christopher)\d
Compare these two patterns. The first means "a word character followed by 'chris' OR 'christopher' followed by a digit". The only way to make the "chris" part alternate is by grouping it together. The second pattern means "a word character, followed by 'chris' or 'christopher', followed by a digit".
You can also group entire subpatterns for quantification:
-
([a-z]{1,5}\d+)+
This means "any letter 1-5 times followed by at least one digit, at least once". It would match "a5", "a5b2", "zs49bf9" etcetera.
The final use is for capturing. When you group an expression, the matching characters are saved and can be re-used later in the pattern:
-
<(b|strong)>\w*</\1>
The first group matches either "b" or "strong". Then later in the pattern you see "\1" (an escaped '1') to represent that match. So that pattern will match a properly formatted "b" or "strong" tag. Here's another example you might use to extract a single or double quoted string:
-
('|")\w*\1
This would match things like:
- "hello"
- 'world'
But not:
- "hello'
- 'world"
The second set doesn't match because the quote characters are not the same, so the match would fail.
Assertion (aka Lookahead and Lookbehind)
Assertions are used to test the preceding or following characters against some expression, without actually consuming the characters. Let me explain that a bit more.
When the regex engine tries to apply a pattern to a string, it "consumes" the string as it goes. It has an internal pointer that moves along the string to keep the current position. Matching "(chris|christopher)" against the string "chris98" would put the internal pointer right after the "s", because thats where the pattern stops. Using an assertion simply checks back or forward, without moving the internal pointer.
For example, let's say I want to match my name "Chris" only when it's part of "Christopher". That is, I don't want to match "christoph" or "christine" or anything else. Here's how I might do it with a so called lookahead:
-
(?=Christopher)Chris
This would match the "Chris" part of "Christopher Nadeau", but would not match "Christine Doe".
The way the regex engine applies this pattern is to look ahead at the starting point to see if all of the characters ahead are "Christopher". The internal pointer is not moved at all. So by the time the lookahead is finished, the engine applies the rest of the pattern "Chris" as normal, starting from the beginning. When the whole pattern is finished being applied to "Christopher Nadeau", the internal pointer is after the "s".
As programmers, we are used to using escape sequences. For example, to insert a double-quote character within a double-quoted string, we escape it like so: "Hello Chris \"Chroder\" Nadeau".
As an example, let's say we are writing some custom parser and need to do the same thing. We want to capture all of the double-quoted strings. This is easy, right?
-
"(.*)"
That captures any character (the dot meta-character means "anything" remember), any number of times when it appears within double-quotes. But what if we wanted to allow the user to escape the double-quote so strings like the one above would be read correctly? That pattern would match "Hello Chris\", which isn't what we want.
We can use a lookbehind to make sure the preceding character is not a backslash:
-
"(.*?)(?<=[^\\])"
By using the lookbehind, we make it so the regex engine will not match the ending quote when it is preceded by a backslash.
There are four types of assertions, two of which I have already demonstrated:
- Positive lookahead: (?=expression)
The pattern is successful if the expression matches the characters to the right of the current position - Negative lookahead: (?!expression)
The pattern is successful if the expression does not match the characters to the right of the current position - Positive lookbehind: (?<=expression)
The pattern is successful if the expression matches the characters to the left of the current position - Negative lookbehind: (?<!expression)
The pattern is successful if the expression does not match the characters to the left of the current position
Anchoring
The last concept, anchoring, is very simple to understand. Say you wanted to match "chris", but only when it was at the very beginning of the string. You do this by anchoring the regular expression to the beginning of the string. What if you wanted to match only at the end of the string? Yup, you need to anchor to the end of the string. Here are three examples:
-
^(chris|christopher)
-
(chris|christopher)$
-
^(chris|christopher)$
The caret (^), when it is the first thing in the pattern, anchors it to the beginning of the string. The dollar sign ($), when the last thing in the pattern, anchors it to the end of the string.
The first pattern means "chris or christopher at the start". The second patterns means "chris or chirstopher at the end". The third pattern means "chris or christopher at the beginning and end", which just means "the string is exactly chris or exactly christopher".
You might be thinking, "why is anchoring important?". Well, let's say you want to validate a number that is in the form of ####-##-## (ie. year-month-day). You might want write:
-
\d{4}-\d{2}-\d{2}
This would work. It matches "1988-12-30". Success? No! It also matches "Blah blah 1988-12-30 blah". The pattern is only telling the regex engine to look for that one expression, it will just skip over all of the non-matching text. So to properly validate the string, you need to anchor it to the beginning and end:
-
^\d{4}-\d{2}-\d{2}$
By anchoring to both the beginning and end, you are essentially saying "the entire string must match this pattern".
Further Reading
I think one of the best sites available on the subject of regular expressions is regular-expressions.info. You might find their reference page particularly useful.
If you're a book person, you should definately pick up a copy of Mastering Regular Expressions from O'Reilly.
Leave a Reply