Basic syntax

Top  Previous  Next

A regular expression is a set of rules that describes a generalised string. If the characters that make up a particular string conform to the rules of a particular regular expression, the regular expression is said to match that string.

 

A few concrete examples usually help after an overblown definition like that one. The regular expression b. matches the strings bovine, above, Bobby, and Bob Jones, but not the strings Bell, b, or Bob. That's because the expression insists that the letter b (lowercase) must be in the string and must be followed immediately by another character.

 

The regular expression b+, on the other hand, requires the lowercase letter b at least once. This expression matches b and Bob in addition to the example matches for b. in the preceding paragraph. The regular expression b* requires zero or more bs, so it matches any string. That seems to be fairly useless, but it makes more sense as part of a larger regular expression. Bob*y, for example, matches all of Boy, Boby, and Bobby but not Boboby.

 

Assertions Several so-called assertions are used to anchor parts of the pattern to word or string boundaries. The ^ assertion matches the start of a string, so the regular expression ^fool matches fool and foolhardy but not tomfoolery or April fool. The following table lists the assertions.

 

Regular-Expression Assertions

 

Assertion        Matches        Example        Matches        Doesn't Match        

^        Start of string        ^fool        foolish        Tomfoolery        

$        End of string        fool$        April fool        Foolish        

\b        Word boundary        be\bside        be side        Beside        

\B        Nonword boundary        be\Bside        beside        be side        

 

Atoms The . (period) that you saw in b. earlier in this chapter is an example of a regular-expression atom. Atoms are, as the name suggests, the fundamental building blocks of a regular expression. A full list of atoms appears in the following table.

 

Regular-Expression Atoms

 

Atom        Matches        Example        Matches        Doesn't Match        

Period (.)        Any character except new line        b.b        Bob        bb        

List of characters in brackets        Any one of those characters        ^[Bb]        Bob, bob        Rbob        

Regular expression in parentheses        Anything that regular expression matches        ^a(b.b)c$        Abobc        abbc        

 

 

Quantifiers A quantifier is a modifier for an atom. It can be used to specify that a particular atom must appear at least once, as in b+. The atom quantifiers are listed in the following table.

 

Regular-Expression Atom Quantifiers

 

Quantifier        Matches        Example        Matches        Doesn't Match        

*        Zero or more instances of the atom        ab*c        ac, abc        abb        

+        One or more instances of the atom        ab+c        Abc        ac        

?        Zero or one instances of the atom        ab?c        ac, abc        abbc        

{n}        n instances of the atom        ab{2}c        Abbc        abbbc        

{n,}        At least n instances of the atom        ab{2,}c        abbc, abbbc        abc        

{n,m}        At least n, most m instances of the atom        ab{2,3}c        Abbc        abbbbcat        

 

Special Characters Several special characters are denoted by backslashed letters, with \n being especially familiar to C programmers, perhaps. The following table lists the special characters.

 

Regular-Expression Special Characters

 

Symbol        Matches        Example        Matches        Doesn't Match        

\d        Any digit        b\dd        b4d        Bad        

\D        Nondigit        b\Dd        bdd        b4d        

\n        New line                                

\r        Carriage return                                

\t        Tab                                

\f        Form feed                                

\s        White-space character                                

\S        Non-white-space character                                

\w        Alphanumeric character        a\wb        a2b        a^b        

\W        Nonalphanumeric character        a\Wb        aa^b        Aabb        

 

Backslashed Tokens It is essential that regular expressions be capable of using all characters, so that all possible strings that occur in the real word can be matched. With so many characters having special meanings, a mechanism is required that allows you to represent any arbitrary character in a regular expression. This mechanism is a backslash (\), followed by a numeric quantity. This quantity can take any of the following formats:

 

Single or double digit        matched quantities after a match. These matched quantities are called backreferences and are explained in a separate section.        

Two-or three-digit octal number        the character with that number as character code, unless it's possible to interpret it as a backreference.        

x, followed by two hexadecimal digits        the character with that number as its character code. \x3e, for example, is >        

c, followed by a single character        the control character. \cG, for example, matches <Ctrl+G>.         

Any other character        the character itself. \&, for example, matches the & character        

 

See also:

 

Regular Expressions Syntax (Advanced)