Symbolic regular expressions

Overview

Symbolic regular expressions are similar to regular expressions. Both attempt to match sequences of items defined by patterns involving sequences, repetition, counting, and so forth. However, symbolic regular expressions are designed specifically to work with the sequences of symbols processed by the Pattern Match tool, instead of sequences of characters.

The Pattern Match tool employs a list of symbolic regular expressions to perform its processing. Each symbolic regular expression in the list attempts to match the sequence of symbols belonging to each record, and if the sequence of symbols is matched, it further assigns a class to each token.

Symbolic regular expression syntax overview

Let's examine some records containing a tokenized email address after it has been passed through the Token Creation and Symbol Creation tools:

ID	TOKEN	SYMBOL
1	`joe`	`WORD`
1	`@`	`AT`
1	`mycompany`	`WORD`
1	`.`	`DOT`
1	`com`	`WORD`

The Pattern Match tool processes only the SYMBOL column of these records, ignoring everything else. The tool extracts the list of symbols from the records, which in this case is WORD AT WORD DOT WORD.

Think of each symbol as an atomic unit, much like you'd think of characters in standard regular expressions. In order to match sequences of symbols like this, you must build symbolic regular expressions that operate on those symbols. A symbol regular expression that matches the above example is simply the sequence itself: WORD AT WORD DOT WORD.

Of course this is not very useful, because it matches only that very specific case, and doesn't do anything with the results of the match. So, let's look at some additional constructs available in the symbol regular expressions, and what they can do:

The period

Also know as dot. This represents any symbol, and is the wildcard character. The regular expression WORD . WORD matches any sequence of three symbols that start and end with WORD. For example, it will match WORD DOT WORD or WORD AT WORD, but would not match WORD DOT DOT WORD.

The question mark

Matches the preceding subexpression zero or one times. If you want one optional match (zero or one), use ?. For example WORD DOT? AT WORD will match WORD AT WORD or WORD DOT AT WORD but not WORD AT AT WORD.

The asterisk

Any sub-expression followed by an asterisk will match that sub-expression zero or more times. The regular expression WORD* DOT WORD will match DOT WORD, WORD DOT WORD, or WORD WORD WORD DOT WORD.

The plus sign

Matches the preceding sub-expression one or more times. For example WORD+ DOT WORD will match WORD DOT WORD or WORD WORD WORD DOT WORD but not DOT WORD.

The vertical bar or pipe

This is used to specify alternatives. For example WORD DOT WORD | AT DOT WORD will match WORD DOT WORD or AT DOT WORD but not WORD DOT WORD DOT WORD.

Note that | is a very low-priority binding. Thus in the above example, the group of the sub-expressions is treated as if you had specified:

(WORD DOT WORD) | (AT DOT WORD)

Parentheses

These can be used to concatenate regular expressions containing the alternation operator |. For example WORD DOT (WORD | AT) DOT WORD groups WORD|AT together, thus changing the default grouping of the example. This expression will match WORD DOT WORD DOT WORD or WORD DOT AT DOT WORD.

Grouping with parentheses also allows subexpressions to be operated on as a whole. For example WORD AT WORD (DOT WORD)+ will match WORD AT WORD DOT WORD or WORD AT WORD DOT WORD DOT WORD but not WORD AT WORD DOT DOT WORD.

This construct is a symbolic regular expression that matches a real email address, complete with an unlimited number of repetitions of the "domain" portion of the email address.

Square brackets

These represent any single symbol contained between the square brackets. For example, the expression WORD [AT DOT]+ WORD will match WORD AT AT WORD or WORD DOT AT DOT DOT WORD but not WORD WORD AT WORD.

The caret

If the first character between brackets is a caret, it means "not these symbols." Any symbol except one within the brackets can match. Thus, the expression [^WORD]+ WORD will match AT DOT WORD or DOT DOT WORD but not WORD DOT WORD.

Counted repetition

If you want a sub-expression to appear a given number of times, follow the sub-expression with a number within {}. For example WORD{2} DOT WOD{2R} matches only WORD WORD DOT WORD WORD.

Range of repetition

If you want a sub-expression to appear a given number of times, where the number of times is between a lower bound N and upper bound M, follow the sub-expression with {N,M}. For example WORD (DOT WORD){1,3} matches WORD DOT WORD or WORD DOT WORD DOT WORD DOT WORD but not WORD or WORD DOT DOT WORD.

Actions

Within the Pattern Match tool, actions assign a class to each matched token. Actions in Symbolic Regular Expressions are denoted by an equal sign = preceded by the name of the class you want to assign to tokens matched by that part of the sub-expression.

For example, consider the input records:

ID	TOKEN	SYMBOL
1	`joe`	`WORD`
1	`@`	`AT`
1	`mycompany`	`WORD`
1	`.`	`DOT`
1	`com`	`WORD`

Let's start with the pattern WORD AT WORD (DOT WORD)+. Adding actions to it defines classes for the parts we want LOGIN=WORD AT DOMAIN=(WORD (DOT WORD)+).

This pattern produces the result table:

ID	TOKEN	SYMBOL	CLASS
1	joe	WORD	LOGIN
1	@	AT
1	mycompany	WORD	DOMAIN
1	.	DOT	DOMAIN
1	com	WORD	DOMAIN

Note that the DOMAIN= action includes the entire parenthetic subexpression that follows.

We could have also specified LOGIN=WORD AT DOMAIN=WORD (DOMAIN=DOT DOMAIN=WORD)+ to achieve exactly the same effect. Or, we could specify LOGIN=WORD AT DOMAIN=WORD (DOT DOMAIN=WORD)+ to exclude the DOTs from DOMAIN.

This would produce the following table:

ID	TOKEN	SYMBOL	CLASS
1	`joe`	`WORD`	`LOGIN`
1	`@`	`AT`
1	`mycompany`	`WORD`	`DOMAIN`
1	`.`	`DOT`
1	`com`	`WORD`	`DOMAIN`

If we want to treat the top-level and second-level domains separately from any prefixes in the domain, we could even add an optional PREFIX LOGIN=WORD AT (PREFIX=WORD PREFIX=(DOT WORD)* DOT)? SECOND=WORD DOT TOP=WORD.

Note that the actions can be at any level in the Symbolic Regular Expression, and the same class name can be used in multiple places. In this case, the expression will produce the following output because there is no prefix, and the prefix is optional (the parenthetic expression is followed by a question mark):

ID	TOKEN	SYMBOL	CLASS
1	`joe`	`WORD`	`LOGIN`
1	`@`	`AT`
1	`mycompany`	`WORD`	`SECOND`
1	`.`	`DOT`
1	`com`	`WORD`	`TOP`

However, suppose the input were:

ID	TOKEN	SYMBOL
1	`joe`	`WORD`
1	`@`	`AT`
1	`mailserver`	`WORD`
1	`.`	`DOT`
1	`mycompany`	`WORD`
1	`.`	`DOT`
1	`com`	`WORD`

Then the same expression would produce:

ID	TOKEN	SYMBOL	CLASS
1	`joe`	`WORD`	`LOGIN`
1	`@`	`AT`
1	`mailserver`	`WORD`	`PREFIX`
1	`.`	`DOT`
1	`mycompany`	`WORD`	`SECOND`
1	`.`	`DOT`
1	`com`	`WORD`	`TOP`

Symbolic regular expression syntax summary

Metacharacter	Matches	Example
`.`	Any symbol	`WORD .WORD` matches `WORD AT WORD`.
`?`	Zero or one occurrences of a subexpression	`(WORD DOT)?` matches the empty set or `WORD DOT`.
`+`	One or more occurrences of a subexpression	`(WORD DOT)+` matches `WORD DOT WORD DOT`.
`*`	Zero or more occurrences of a subexpression	`(WORD DOT)*` matches the empty set or `WORD DOT WORD DOT`.
`( )`	Grouped pattern	`(WORD DOT)+ WORD` matches `WORD DOT WORD` and `WORD DOT WORD DOT WORD` but not `WORD WORD DOT WORD`.
`[ ]`	One of the specified symbols	`[WORD DOT]+` matches `WORD WORD DOT`.
`[^ ]`	Not one of the specified symbols	`[^WORD]+` matches `AT DOT DOT` but not `AT WORD DOT`.
`\|`	One expression or another	`WORD \| DOT AT` matches WORD or `DOT AT`.
`{m}`	Repeat exactly m times	`{3}` matches any three symbols.
`{m,n}`	Repeat m to n times	`WORD{1,3}` matches `WORD WORD`.
`.=`	Placed before subexpression to assign a class to matched tokens	`LNAME=WORD COMMA FNAME=WORD` gets first and last names from (for example) Smith, Mary.