Basic syntax |
Top Previous Next |
A regular expression is a set of rules that describes a generalised string. If the characters that make up a particular string conform to the rules of a particular regular expression, the regular expression is said to match that string.
A few concrete examples usually help after an overblown definition like that one. The regular expression b. matches the strings bovine, above, Bobby, and Bob Jones, but not the strings Bell, b, or Bob. That's because the expression insists that the letter b (lowercase) must be in the string and must be followed immediately by another character.
The regular expression b+, on the other hand, requires the lowercase letter b at least once. This expression matches b and Bob in addition to the example matches for b. in the preceding paragraph. The regular expression b* requires zero or more bs, so it matches any string. That seems to be fairly useless, but it makes more sense as part of a larger regular expression. Bob*y, for example, matches all of Boy, Boby, and Bobby but not Boboby.
Assertions Several so-called assertions are used to anchor parts of the pattern to word or string boundaries. The ^ assertion matches the start of a string, so the regular expression ^fool matches fool and foolhardy but not tomfoolery or April fool. The following table lists the assertions.
Regular-Expression Assertions
Assertion Matches Example Matches Doesn't Match ^ Start of string ^fool foolish Tomfoolery $ End of string fool$ April fool Foolish \b Word boundary be\bside be side Beside \B Nonword boundary be\Bside beside be side
Atoms The . (period) that you saw in b. earlier in this chapter is an example of a regular-expression atom. Atoms are, as the name suggests, the fundamental building blocks of a regular expression. A full list of atoms appears in the following table.
Regular-Expression Atoms
Atom Matches Example Matches Doesn't Match Period (.) Any character except new line b.b Bob bb List of characters in brackets Any one of those characters ^[Bb] Bob, bob Rbob Regular expression in parentheses Anything that regular expression matches ^a(b.b)c$ Abobc abbc
Quantifiers A quantifier is a modifier for an atom. It can be used to specify that a particular atom must appear at least once, as in b+. The atom quantifiers are listed in the following table.
Regular-Expression Atom Quantifiers
Quantifier Matches Example Matches Doesn't Match * Zero or more instances of the atom ab*c ac, abc abb + One or more instances of the atom ab+c Abc ac ? Zero or one instances of the atom ab?c ac, abc abbc {n} n instances of the atom ab{2}c Abbc abbbc {n,} At least n instances of the atom ab{2,}c abbc, abbbc abc {n,m} At least n, most m instances of the atom ab{2,3}c Abbc abbbbcat
Special Characters Several special characters are denoted by backslashed letters, with \n being especially familiar to C programmers, perhaps. The following table lists the special characters.
Regular-Expression Special Characters
Symbol Matches Example Matches Doesn't Match \d Any digit b\dd b4d Bad \D Nondigit b\Dd bdd b4d \n New line \r Carriage return \t Tab \f Form feed \s White-space character \S Non-white-space character \w Alphanumeric character a\wb a2b a^b \W Nonalphanumeric character a\Wb aa^b Aabb
Backslashed Tokens It is essential that regular expressions be capable of using all characters, so that all possible strings that occur in the real word can be matched. With so many characters having special meanings, a mechanism is required that allows you to represent any arbitrary character in a regular expression. This mechanism is a backslash (\), followed by a numeric quantity. This quantity can take any of the following formats:
Single or double digit matched quantities after a match. These matched quantities are called backreferences and are explained in a separate section. Two-or three-digit octal number the character with that number as character code, unless it's possible to interpret it as a backreference. x, followed by two hexadecimal digits the character with that number as its character code. \x3e, for example, is > c, followed by a single character the control character. \cG, for example, matches <Ctrl+G>. Any other character the character itself. \&, for example, matches the & character
See also:
|