Regex

Overview

The Regular Expression (Regex) tool can split and match text strings using regular expressions. It is useful for finding and extracting portions strings that match (or don't match) a regular expression or a sub-group in a regular expression. Some common uses of the Regex tool are.

Finding numbers

Input	Regex	Output1	Output2	Output3
123 Main St #45	\d+	123	45
Acme Software:(303)541-1515	\d+	303	541	1516

Matching patterns

Input	Regex	Output1	Output2
Acme Software:(303)541-1515	\(\d{3}\)\d{3}-\d{4}	(303)541-1515
111-22-3333/444-55-6666	\d{3}-\d{2}-\d{4}	111-22-3333	444-55-6666

Splitting words

Input	Regex	Out1	Out2	Out3	Out4	Out5	Out6	Out7	Out8
Now, is the... time for all good men?	\w+	Now	is	the	time	for	all	good	men

The Regex tool operates in several different modes. Each has its own uses, depending on what you want to achieve.

Splitting

Use splitting when you have a string, and you want to split out substrings that match the regex. Splitting can produce either "wide" output (many columns in one record) or "tall" output (one column and many records). For example, a "wide" split on the regex \d+ can produce the following.

Input	Output1	Output2	Output3
(303)541-1515	303	541	1516

Whereas a "tall" split would produce the following.

Input	Output
(303)541-1515	303
(303)541-1515	541
(303)541-1515	1516

Splitting can also be configured to produce the unmatched parts of the string.

Input	Output1	Output2	Output3
(303)541-1515	(	)	-

Or both matched and unmatched.

Input	Out1	Out2	Out3	Out4	Out5	Out6
(303)541-1515	(	303	)	541	-	1516

Extracting

Use extracting when you have a string, and you want to extract portions of the string that match sub-parts of the regex—what are known as the "capturing groups" of the regex. Data Management treats the entire regex as the first "capturing group". Subsequent capturing groups are those portions of the regex enclosed in parentheses. For example, in the regex: (ab(cd)(ef)) there are four groups. It is possible to write regexes where some groups do not capture; see the external references for more details.

Regex tool configuration parameters

The Regex tool has one set of configuration parameters in addition to the standard execution options.

Parameter	Description
Input field	The input field to process.
Regular expression	The regular expression to compare to Input field.
Case insensitive	If selected, perform case-insensitive matching.
Operation	The operation to perform upon the input field. This is optional and defaults to Split (wide output).
Output field	If Operation is Split (wide output), the base name of the output field. This is optional and defaults to `OUTPUT`.
Number of outputs	If Operation is Split (wide output), the number of output fields to generate. This is optional and defaults to 10.
Capture	If Operation is Split (tall output) or Extract (repeating), determines which data to output. This is optional and defaults to Matched.
Include empty matches when splitting	If selected, includes empty strings in output when Operation is Split (wide output) or Split (tall output).
Generate ID	If selected, sequentially numbers the output records according to the input record they came from when Operation is Split (tall output) or Extract (repeating).
ID field	If Generate ID is selected, the ID field. This is optional and defaults to ID.
Generate sequence	If selected, generates a sequence number containing the position of the output record within its group when Operation is Split (tall output) or Extract (repeating).
Sequence field	If Generate sequence is selected, the sequence field. This is optional and defaults to `SEQUENCE`.

Configure the Regex tool

Select the Regex tool.
Go to the Configuration tab on the Properties pane.
Choose an Input field.
Enter a regular expression in the Regular Expression box.
Optionally, select Case insensitive to ignore case when matching.
Choose the Operation:
- Split (wide output)
- Split (tall output)
- Extract (first)
- Extract (repeating)
Optionally, specify an output field name other than the default OUTPUT.
If Operation is Split (wide output), specify the Number of outputs. This will be the size of your "wide" output. Data Management will generate up to 10 sequentially numbered fields based on your output field name.
Optionally, select Include empty matches when splitting. Normally, empty strings are omitted from the output. This option is rarely needed.
For the Split (tall output) and Extract (repeating) operations, you can optionally generate IDs to help match the multiple output records to each input record:
- Select Generate ID and specify a field name to output the input record number, starting at one.
- Select Generate Sequence and supply a field name to number records within each ID, starting at one.
Optionally, go to the Execution tab, and then set Web service options.