Item name | Conventional regular expression sign | Usage examples and explanations |
---|---|---|
Any character | . | c.t - denotes words like «cat», «cot» |
Character from a character range | [] | [b-d]ell - denotes words like «bell», «cell», «dell»
[ty]ell - denotes words "tell" and "yell". |
Character out of a character range | [^] | [^y]ell - denotes words like "dell", "cell", "tell", but
forbids "yell"
[^n-s]ell - denotes words like "bell", "cell", but forbids "nell", "oell", "pell", "qell", "rell" and "sell" |
Or | | | c(a|u)t - denotes words "cat" and "cut" |
0 or more occurrences in a row | * | 10* - denotes numbers 1, 10, 100, 1000 etc. |
1 or more occurrences in a row | + | 10+ - it allows numbers 10, 100, 1000 etc., but forbids 1. |
Letter or digit | [0-9a-zA-Zа-яА-Я] | [0-9a-zA-Zа-яА-Я] - it allows a single character; [0-9a-zA-Zа-яА-Я]+ - it allows any word |
Capital Latin letter | [A-Z] | |
Small Latin letter | [a-z] | |
Capital Cyrillic letter | [А-Я] | |
Small Cyrillic letter | [а-я] | |
Digit | [0-9] | |
Space | \s | |
Character, used by system. | @ |
Note:
Suppose you need to recognize "personal data" tables (we'll use sample Russian personal data document), and suppose these tables contain such fields as passport issue date, the First and Last Names, and the passport series and number. You may create new Data and Passport languages and set regular expressions for them.
The number denoting day may consist of one digit (e.g. 1, 2 etc.) or two digits (e.g. 02, 12), but it cannot be zero (00 or 0). The regular expression for the day should then look like this: ((|0)[1-9])|([1|2][0-9])|(30)|(31).
The regular expression for the month should look like this: ((|0)[1-9])|(10)|(11)|(12).
The regular expression for the year should look like this: ([19][0-9][0-9]|([0-9][0-9])|([20][0-9][0-9]|([0-9][0-9]).
What is left is to combine all this together and separate the numbers by period (like 1.03.1999). The period is an auxiliary sign, so we must put a backslash (\) before it. The regular expression for the full date should then look like this:
((|0)[1-9])|([1|2][0-9])|(30)|(31)\.((|0)[1-9])|(10)|(11)|(12)\.((19)[0-9][0-9])|([0-9][0-9])|([20][0-9][0-9]|([0-9][0-9])
The Russian passport series is a Roman number from the following set: 1-10, 20, and 30, then a hyphen, then two capital Cyrillic letters, like VI-СБ, or XXX-МЮ. The regular expression for the passport series should look like this:
((|X|XX|XXX)(|I|II|III|IV|V|VI|VII|VIII|IX))-[А-Я][А-Я]
"E-mail address" language
You can easily make a language for denoting e-mail addresses. The regular expression for an e-mail address should look like this:
[a-zA-Z0-9_\-\.]+\@[a-z0-9\.\-]+