Symbolic regular expressions
Overview
Symbolic regular expressions are similar to regular expressions. Both attempt to match sequences of items defined by patterns involving sequences, repetition, counting, and so forth. However, symbolic regular expressions are designed specifically to work with the sequences of symbols processed by the Pattern Match tool, instead of sequences of characters.
The Pattern Match tool employs a list of symbolic regular expressions to perform its processing. Each symbolic regular expression in the list attempts to match the sequence of symbols belonging to each record, and if the sequence of symbols is matched, it further assigns a class to each token.
Symbolic regular expression syntax overview
Let's examine some records containing a tokenized email address after it has been passed through the Token Creation and Symbol Creation tools:
ID | TOKEN | SYMBOL |
---|---|---|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
The Pattern Match tool processes only the SYMBOL column of these records, ignoring everything else. The tool extracts the list of symbols from the records, which in this case is WORD AT WORD DOT WORD
.
Think of each symbol as an atomic unit, much like you'd think of characters in standard regular expressions. In order to match sequences of symbols like this, you must build symbolic regular expressions that operate on those symbols. A symbol regular expression that matches the above example is simply the sequence itself: WORD AT WORD DOT WORD
.
Of course this is not very useful, because it matches only that very specific case, and doesn't do anything with the results of the match. So, let's look at some additional constructs available in the symbol regular expressions, and what they can do:
The period
Also know as dot. This represents any symbol, and is the wildcard character. The regular expression WORD . WORD
matches any sequence of three symbols that start and end with WORD. For example, it will match WORD DOT WORD
or WORD AT WORD
, but would not match WORD DOT DOT WORD
.
The question mark
Matches the preceding subexpression zero or one times. If you want one optional match (zero or one), use ?. For example WORD DOT? AT WORD
will match WORD AT WORD
or WORD DOT AT WORD
but not WORD AT AT WORD.
The asterisk
Any sub-expression followed by an asterisk will match that sub-expression zero or more times. The regular expression WORD* DOT WORD
will match DOT WORD
, WORD DOT WORD
, or WORD WORD WORD DOT WORD
.
The plus sign
Matches the preceding sub-expression one or more times. For example WORD+ DOT WORD
will match WORD DOT WORD
or WORD WORD WORD DOT WORD
but not DOT WORD
.
The vertical bar or pipe
This is used to specify alternatives. For example WORD DOT WORD | AT DOT WORD
will match WORD DOT WORD
or AT DOT WORD
but not WORD DOT WORD DOT WORD
.
Note that | is a very low-priority binding. Thus in the above example, the group of the sub-expressions is treated as if you had specified:
(WORD DOT WORD) | (AT DOT WORD)
Parentheses
These can be used to concatenate regular expressions containing the alternation operator |. For example WORD DOT (WORD | AT) DOT WORD
groups WORD|AT
together, thus changing the default grouping of the example. This expression will match WORD DOT WORD DOT WORD
or WORD DOT AT DOT WORD
.
Grouping with parentheses also allows subexpressions to be operated on as a whole. For example WORD AT WORD (DOT WORD)+
will match WORD AT WORD DOT WORD
or WORD AT WORD DOT WORD DOT WORD
but not WORD AT WORD DOT DOT WORD
.
This construct is a symbolic regular expression that matches a real email address, complete with an unlimited number of repetitions of the "domain" portion of the email address.
Square brackets
These represent any single symbol contained between the square brackets. For example, the expression WORD [AT DOT]+ WORD
will match WORD AT AT WORD
or WORD DOT AT DOT DOT WORD
but not WORD WORD AT WORD
.
The caret
If the first character between brackets is a caret, it means "not these symbols." Any symbol except one within the brackets can match. Thus, the expression [^WORD]+ WORD
will match AT DOT WORD
or DOT DOT WORD
but not WORD DOT WORD
.
Counted repetition
If you want a sub-expression to appear a given number of times, follow the sub-expression with a number within {}. For example WORD{2} DOT WOD{2R}
matches only WORD WORD DOT WORD WORD
.
Range of repetition
If you want a sub-expression to appear a given number of times, where the number of times is between a lower bound N and upper bound M, follow the sub-expression with {N,M}. For example WORD (DOT WORD){1,3}
matches WORD DOT WORD
or WORD DOT WORD DOT WORD DOT WORD
but not WORD
or WORD DOT DOT WORD
.
Actions
Within the Pattern Match tool, actions assign a class to each matched token. Actions in Symbolic Regular Expressions are denoted by an equal sign = preceded by the name of the class you want to assign to tokens matched by that part of the sub-expression.
For example, consider the input records:
ID | TOKEN | SYMBOL |
---|---|---|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
Let's start with the pattern WORD AT WORD (DOT WORD)+
. Adding actions to it defines classes for the parts we want LOGIN=WORD AT DOMAIN=(WORD (DOT WORD)+)
.
This pattern produces the result table:
ID | TOKEN | SYMBOL | CLASS |
---|---|---|---|
1 | joe | WORD | LOGIN |
1 | @ | AT | |
1 | mycompany | WORD | DOMAIN |
1 | . | DOT | DOMAIN |
1 | com | WORD | DOMAIN |
Note that the DOMAIN= action includes the entire parenthetic subexpression that follows.
We could have also specified LOGIN=WORD AT DOMAIN=WORD (DOMAIN=DOT DOMAIN=WORD)+
to achieve exactly the same effect. Or, we could specify LOGIN=WORD AT DOMAIN=WORD (DOT DOMAIN=WORD)+
to exclude the DOTs from DOMAIN.
This would produce the following table:
ID | TOKEN | SYMBOL | CLASS |
---|---|---|---|
1 |
|
|
|
1 |
|
| |
1 |
|
|
|
1 |
|
| |
1 |
|
|
|
If we want to treat the top-level and second-level domains separately from any prefixes in the domain, we could even add an optional PREFIX LOGIN=WORD AT (PREFIX=WORD PREFIX=(DOT WORD)* DOT)? SECOND=WORD DOT TOP=WORD
.
Note that the actions can be at any level in the Symbolic Regular Expression, and the same class name can be used in multiple places. In this case, the expression will produce the following output because there is no prefix, and the prefix is optional (the parenthetic expression is followed by a question mark):
ID | TOKEN | SYMBOL | CLASS |
---|---|---|---|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
However, suppose the input were:
ID | TOKEN | SYMBOL |
---|---|---|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
1 |
|
|
Then the same expression would produce:
ID | TOKEN | SYMBOL | CLASS |
---|---|---|---|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
1 |
|
|
|
Symbolic regular expression syntax summary
Metacharacter | Matches | Example |
| Any symbol |
|
| Zero or one occurrences of a subexpression |
|
| One or more occurrences of a subexpression |
|
| Zero or more occurrences of a subexpression |
|
| Grouped pattern |
|
| One of the specified symbols |
|
| Not one of the specified symbols |
|
| One expression or another |
|
| Repeat exactly m times |
|
| Repeat m to n times |
|
| Placed before subexpression to assign a class to matched tokens |
|