Regular expressions (Data Management variant)

Overview

This section describes Data Management's variant on the standard regular-expression language. It applies to the following tools:

And to the following deprecated functions:

See Using regular expressions for documentation of standard regular expressions and the tools and functions that use them.

Data Management variant regular expressions

A regular expression is a sequence of characters that represents a pattern. The pattern can be a fixed word, like John or purple, or can describe something more general, like any word starting with the letter T and ending in a vowel, or any five-digit number which doesn't contain a 9. Data Management's pattern tools use regular expressions to describe the text strings being matched.

Data Management actually uses two closely related types of regular expressions. The first, which we'll simply call regular expressions, is designed to match and process text strings, and is similar to what you may have seen in UNIX system tools like grep and libraries like regexp. The second is a special case of regular expressions, which operates on sets of symbols instead of text, and is specific to the Pattern Match tool. We call these symbolic regular expressions.

Differences from standard regexp (Data Management variant)

Data Management's regular expression language is similar to others you may be familiar with, with a few significant differences:

Each character literal must be enclosed in single quotes
Sequences of character literals (strings) must be enclosed in double quotes: anything in the target text that consists of exactly those characters in exactly the order listed will match. A lower case character is not identical to its upper case version, and vice versa. For example, "John" will match John but not john.
Some characters must be preceded by a backslash (\): backslashes are combined with the characters r or n to signify carriage return or newline characters. For example, the regexp AB\r\nCD will match AB separated from CD by a CR-LF sequence. To match the backslash character itself, simply double it (\\).
No "anchor" characters: standard regular expression languages have special constructs to specify the position at which a particular pattern occurs—the beginning or end of an input string, for example. However, Data Management's regular expressions implicitly anchor both the front and end of the string being matched. To "unanchor" the patterns, add the sequence .* to the beginning or end of the pattern.
No backreferences: standard regular expression languages have "backreferences," which allow you to reference a part of the matched string later in the pattern. Data Management does not support backreferences.
Substring references are different: standard regular expressions only allow you to reference substrings by using top-level subexpressions (parts of the regular expression grouped with parentheses () at the top level of parentheses nesting). Data Management supports a much more flexible model, in which any portion of a string matched by any part of a pattern may be tagged by an "action" for later processing.

Regular expression syntax (Data Management variant)

Regular expressions are constructed by combining quoted strings of characters (representing text to be matched) with some special characters, termed metacharacters.

Single quotes

These are used to enclose each character literal. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lower case character is not identical to its upper case version, and vice versa. For example: 'J''o'h''n' matches John but not john.

Note that spaces are ignored unless they are enclosed in single quotes.

Double quotes

These are used to enclose each string literal. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lower case character is not identical to its upper case version, and vice versa. For example:"John" matches John but not john.

The period

Also know as dot. This represents any character, and is the wildcard. The regular expression:'p' . 'n' matches the words pin, pan, pen, or any string where a p and an n appear, separated by one character. For instance "Peters" . 'n' matches Peterson and Petersen.

Users familiar with DOS command-line wildcards know the question mark as filling the role of "some character" in command masks. But in regular expressions, the question mark has a different meaning, and the period is used as a wildcard.

The question mark

Matches the preceding subexpression zero or one time. If you want one optional match (zero or one), use "?". For example: 'c' ? "an" matches an and can.

The asterisk

Any one-character expression followed by an asterisk will match that character zero or more times. The regular expression: 'h' 'i' 's' * matches his, hiss, or hissssssssssss—but also hi (an h, followed by an i, followed by zero s characters). Similarly: "tech" .* "support" would locate references to techsupport as well as technical advisors and support, so long as tech precedes support on the same line.

The plus sign

Matches the preceding subexpression one or more times. For example: 'z' 'o'+ matches zo and zoo, but not z.

The vertical bar or pipe

This is used to specify alternatives. For example: 'a''b''c'|['0''9']+ matches either abc or an integer. The pipe has the lowest precedence of all the regular expression operators.

Parentheses

These can be used to concatenate regular expressions containing the alternation operator |. For example 'g''r'('a'|'e')'y' groups the 'a'|'e' together, matching both the American and British spellings of the color gray (grey). Without these explicit parentheses, the above expression would be grouped as if you had specified: ('g''r''a') | ('e''y'). Grouping with parentheses also allows subexpressions to be operated on as a whole. For example ('a''b')+'c' matches abc, ababc, and so on.

The dash

Used with square brackets to indicate a range of characters (see below).

Square brackets

These represent any single character contained between the square brackets. For example 'p'['i' 'a' 'e']'n' matches the words pin or pan or pen, but not pun. It will also not match pain, even though both a and i appear between the square brackets.

Square brackets may also be used to match a range of characters, when two characters between the brackets are separated by a hyphen. So the regular expression ['A'--'Z' ] matches any uppercase letter, and ['a'--'f' 'h'--'z'] matches any lowercase letter except g.

The caret

If the first character between brackets is a caret, it means "not these characters"; any character except one within the brackets can match. The expression [^'1'-'5'] matches any character except the digits one through five.

Shorthands

There are a number of single-letter shorthands for the common cases of matching digits, letters, or other characters and symbols. These shorthands are as follows.

Shorthand	Meaning
d	Any digit
D	Anything not a digit
a	Any letter
A	Anything not a letter
s	Any whitespace
S	Anything not whitespace
w	Any alphanumeric character
W	Anything not an alphanumeric character

For example a+ d+ a+ matches digits sandwiched between letters.

Counted repetition

If you want a sub-expression to appear a given number of times, follow the sub-expression with a number in curly brackets {}. For example a{2} ' ' d{5} matches exactly two letters separated from five digits by a space, such as AB 12345.

Range of repetition

If you want a sub-expression to appear a given number of times, where the number of times is between a lower bound N and upper bound M, follow the sub-expression with {N,M}. For example (a d){2,6} matches two-to-six repetitions of a letter followed by a digit, such as a1z9 or b4j2k0.

Actions

In some tools, you want the regular-expression function or tool to perform an action when it matches sub-parts of a regular expression. For example, in the Token Creation tool you use actions to split out parts of a token into new tokens. In the RegularExpressionExtract and RegularExprFormat functions, you use actions to designate portions of the matched string to be extracted to the function result.

You indicate an action by placing the equal sign = before an expression term.

Actions in the Token Creation tool

In the Token Creation tool, use regular expressions with actions to split apart complex tokens. For example, in catalogs it is common to encounter an "amount" which is a quantity followed by a unit, like 1000ml. To split text like this into separate tokens, use an expression such as =(d+) =(a+).

This tells the token creation tool to split the text into two parts, and assign each part to a token. In this example, 1000ml would be split into 1000 and ml. Don't forget the equal sign—any part of the text matching a sub-expression but without an action will be ignored.

Actions in the RegularExprFormat function

In the RegularExprFormat function, you place a number before the equal sign = to both designate that expression term as an action, and also to give that action a reference number that can be used in the format string. For exampleRegularExprFormat("123456789", "1=d{3} 2=d{2} 3=d{4}", "%1%-%2%-%3%" ) returns 123-45-6789.

Regular expression syntax summary (Data Management variant)

Metacharacter	Matches	Example
' '	Character literal	'B''l''u''e' matches blue.
" "	String literal	"Bat" matches Bat.
.	Any single character	'g'.'t' finds get, got, gut, g t.
?	Any string of characters (zero or one)	"do" ("es") ? matches do and does.
+	Any string of characters (one or more)	'w'.+ 'e' finds wide, white, write but not we.
*	Any string of characters (or none)	'w' .* 'e' finds wide, white, write and we.
( )	Grouped pattern	('a' 'b')+ matches ab and abab but not aab.
(?>...)	Grouped pattern with no backtracking. This is an optimization for advanced users who already understand Regular Expressions in depth.	d+a+ will run faster if expressed as `(?>d+)(?>a+)`.
[ ]	One of the specified characters	'g'['e''o']'t' finds get and got but not gut.
[ - ]	One of the characters in a range	['b'–'p']"at" finds bat, cat, fat, hat, mat but not rat or sat.
[^ - ]	One of the characters not in a range	[^'b'–'p'] "at") matches r in rat & s in sat but nothing in bat, cat, or hat.
\|	One expression or another	'W'("in"\|"indows") will find Win or Windows.
d	Any digit	d+ matches an integer.
D	Anything not a digit	D+ matches john but not 3john.
a	Any letter	a+ matches susan but not sus-an.
A	Anything not a letter	A+ matches 123-456.
s	Any whitespace	a+sa+sa+ matches kilroy was here.
S	Anything not whitespace	S+sS+sS+ matches we're number 1.
w	Any alphanumeric	w+ matches 1000ml but not 1000#.
W	Anything not alphanumeric	w+Ww+ matches Lotus-123.
{m}	Repeat exactly m times	a{3} matches any three-letter sequence.
{m,n}	Repeat m to n times	a{2,3} matches any two- or three-letter sequence.
=	Placed before sub-part of regular expression to denote an action	=(d+) =(a+) splits matched string 1000ml into two tokens: 1000 and ml.

Regular expression examples (Data Management variant)

Email address: (w|'-'|'.'|'_')+ '@' ((w|'-')+'.')+ (w|'-')+
10-digit phone number with optional parentheses/dashes: '('? d{3} ')'? d{3} '-'? d{4}
Social security number with optional dashes: d{3} '-'? d{2} '-'? d{4}
ZIP Code (+4 optional): d{5} ('-'d{4})?