Regex
The Regular Expression (Regex) tool can split and match text strings using regular expressions. It is useful for finding and extracting portions strings that match (or don't match) a regular expression or a sub-group in a regular expression. Some common uses of the Regex tool are:
Finding numbers
Input | Regex | Output1 | Output2 | Output3 |
---|---|---|---|---|
123 Main St #45 | \d+ | 123 | 45 |
|
Acme Software: (303)541-1515 | \d+ | 303 | 541 | 1516 |
Matching patterns
Input | Regex | Output1 | Output2 |
---|---|---|---|
Acme Software: (303)541-1515 | \(\d{3}\)\d{3}-\d{4} | (303)541-1515 |
|
111-22-3333/444-55-6666 | \d{3}-\d{2}-\d{4} | 111-22-3333 | 444-55-6666 |
Splitting words
Input | Regex | Out1 | Out2 | Out3 | Out4 | Out5 | Out6 | Out7 | Out8 |
---|---|---|---|---|---|---|---|---|---|
Now, is the... time for all good men? | \w+ | Now | is | the | time | for | all | good | men |
The Regex tool operates in several different modes. Each has its own uses, depending on what you want to achieve:
Splitting
Use splitting when you have a string, and you want to split out substrings that match the regex. Splitting can produce either "wide" output (many columns in one record) or "tall" output (one column and many records). For example, a "wide" split on the regex \d+ can produce:
Input | Output1 | Output2 | Output3 |
---|---|---|---|
(303)541-1515 | 303 | 541 | 1516 |
whereas a "tall" split would produce:
Input | Output |
---|---|
(303)541-1515 | 303 |
(303)541-1515 | 541 |
(303)541-1515 | 1516 |
Splitting can also be configured to produce the unmatched parts of the string:
Input | Output1 | Output2 | Output3 |
---|---|---|---|
(303)541-1515 | ( | ) | - |
... or both matched and unmatched:
Input | Out1 | Out2 | Out3 | Out4 | Out5 | Out6 |
---|---|---|---|---|---|---|
(303)541-1515 | ( | 303 | ) | 541 | - | 1516 |
Extracting
Use extracting when you have a string, and you want to extract portions of the string that match sub-parts of the regex—what are known as the "capturing groups" of the regex. Data Management treats the entire regex as the first "capturing group". Subsequent capturing groups are those portions of the regex enclosed in parentheses. For example, in the regex:
(ab(cd)(ef))
there are four groups. It is possible to write regexes where some groups do not capture; see the external references for more details.
Regex tool configuration parameters
The Regex tool has one set of configuration parameters in addition to the standard execution options:
Input field | The input field to process. |
Regular expression | The regular expression to compare to Input field. |
Case insensitive | If selected, perform case-insensitive matching. |
Operation | The operation to perform upon the input field. This is optional and defaults to Split (wide output). |
Output field | If Operation is Split (wide output), the base name of the output field. This is optional and defaults to OUTPUT. |
Number of outputs | If Operation is Split (wide output), the number of output fields to generate. This is optional and defaults to 10. |
Capture | If Operation is Split (tall output) or Extract (repeating), determines which data to output. This is optional and defaults to Matched. |
Include empty matches when splitting | If selected, includes empty strings in output when Operation is Split (wide output) or Split (tall output). |
Generate ID | If selected, sequentially numbers the output records according to the input record they came from when Operation is Split (tall output) or Extract (repeating). |
ID field | If Generate ID is selected, the ID field. This is optional and defaults to ID. |
Generate sequence | If selected, generates a sequence number containing the position of the output record within its group when Operation is Split (tall output) or Extract (repeating). |
Sequence field | If Generate sequence is selected, the sequence field. This is optional and defaults to SEQUENCE. |
Configure the Regex tool
Select the Regex tool, and then go to the Configuration tab on the Properties pane.
Choose an Input field.
Enter a regular expression in the Regular Expression box.
Optionally, select Case insensitive to ignore case when matching.
Choose the Operation:
Split (wide output)
Split (tall output)
Extract (first)
Extract (repeating)
Optionally, specify an output field name other than the default OUTPUT.
If Operation is Split (wide output), specify the Number of outputs. This will be the size of your "wide" output. Data Management will generate up to 10 sequentially numbered fields based on your output field name.
Optionally, select Include empty matches when splitting. Normally, empty strings are omitted from the output. This option is rarely needed.
For the Split (tall output) and Extract (repeating) operations, you can optionally generate IDs to help match the multiple output records to each input record:
Select Generate ID and specify a field name to output the input record number, starting at one.
Select Generate Sequence and supply a field name to number records within each ID, starting at one.
Optionally, go to the Execution tab, and then set Web service options.