Regular expressions

Overview

Data Management uses several different forms of regular expressions.

Perl regex

The primary regular expression syntax supported by Data Management follows the Perl syntax by using the open-source library PCRE. This library is well-supported and maintained, and is amply documented. It also handles many advanced constructs that the Data Management variant used in the Pattern Matching tools do not, such as:

Unicode character classes
Non-backtracking groups
Forward and reverse assertions
Matching back-references

This regular expression syntax is used by the Regex tool, as well as the Data Management functions prefixed with Regex:

Java regex

The Regex Match Table tool uses the Java regular expression API rather than Perl PCRE. Java regular expressions are very similar to the Perl syntax. See Comparison to Perl 5 for a list of differences between the two.

Data Management variant regex

A few tools use a non-standard implementation developed to meet Data Management's specific requirements for parsing and matching. This Data Management variant regular expressions should be used with the following tools:

Data Management variant regular expressions are also used in the Data Management functions prefixed with RegularExpr. These functions have been retained for backwards compatibility, but you should use the newer equivalents prefixed with Regex instead. See About regular expressions (Data Management variant) for more information on the Data Management variant regular expression language.

Using regular expressions in Data Management

In Data Management, you use regular expressions as arguments to the functions prefixed with Regex:

In an expression, regular expressions can be entered as strings: RegexMatch(FIELD, "\d+").

They can also be entered using the /.../ syntax. This has the advantage that the \r, \n, \t and \\ escape sequences are not "double-applied" as they would be for strings: RegexMatch(FIELD, /\d+/).

Or, the regular expression can be stored in a field or computed using an expression, like: RegexMatch(FIELD, REGEX+"[a-z]").

Unicode and regular expressions

Data Management's regular expressions automatically handle Unicode text, and any international text can be entered as a literal in a regular expression. However, the standard regular expression character classes \s, \w, and only match to ASCII characters, so if you are processing international text, you will need to use the Unicode character class notation instead.

ASCII form	Unicode form
\d	\p{N}
\D	\P{N}
\s	\p{Z}
\S	\P{Z}
\w	(\p{L}\|\p{D}\|_)

In addition, there are some additional Unicode character classes that are very useful.

Notation	Meaning
\p{L}	Letter
\p{Ll}	Lowercase letter
\p{Lu}	Uppercase letter
\p{Sc}	Currency symbol
\p{P}	Punctuation

See Unicode Regular Expressions for additional character class descriptions.