Pattern Match
Overview
The Pattern Match tool is used as the third step of textual parsing. It allows you to search record groups for patterns of tokens and re-classify each token according to its position within a matched pattern. For example, a token that was assigned the NUMBER
symbol in the Token Creation tool may be re-classified as an ORDER_QUANTITY
based on its position within a larger context.
All of the tokens in a record group are analyzed together and matched against your list of symbol patterns. These are Symbolic Regular Expressions constructed using the symbols assigned in the Symbol Creation and/or Table Lookup tools. Within each pattern you can specify multiple assignment actions, which associate tokens matching a sub-part of the pattern with an output class.
A list of symbols is created from the record group and processed using the symbol patterns.
Each assignment in a symbol pattern maps to a class value. You specify assignments by preceding part of the regular expression with
CLASS=
, whereCLASS
is the name of the class you want to assign to the token.Symbols that match part of the expression not covered by an assignment are not assigned a class value.
By default, a new field named CLASS
receives the class values. You can assign the class values to an existing field or define a new field name.
Examples:
For example, suppose you are matching email addresses, and you have split the text into tokens and assigned symbols as follows:
WORD: any sequence of letters.
AT: the @ sign.
DOT: a period.
In this example we've split the email address joe@mycompany.com into tokens (using the Token Creation tool) and assigned symbols to the tokens (using the Symbol Creation tool). This results in the following record group.
ID | TOKEN | SYMBOL |
---|---|---|
1 | joe |
|
1 | @ |
|
1 | mycompany |
|
1 | . |
|
1 | com |
|
In general, matching emails will require a complex pattern, or a list of simple patterns. This example uses a single simple pattern: WORD AT WORD DOT WORD
.
To match the full spectrum of email addresses, you would need to extend this pattern.
To extract the login and domain portions of email addresses that match this pattern, add assignments to associate them with classes.
Grouping parentheses surround the part of the pattern that describes the domain name, ensuring that all DOMAIN symbols are included in the assignment: LOGIN=WORD AT DOMAIN=(WORD DOT WORD)
.
This would result in the output record set.
ID | TOKEN | SYMBOL | CLASS |
---|---|---|---|
1 | joe |
| LOGIN |
1 | @ |
| |
1 | mycompany |
| DOMAIN |
1 | . |
| DOMAIN |
1 | com |
| DOMAIN |
The @ symbol is not assigned any class, because it was not associated with an assignment.
When you tackle more realistic situations, you may find that in some circumstances you want to treat a token as a "special case" and at other times as a more general case. For example, suppose that you have a record consisting of an account number, followed by a name, followed by a phone number: 654 John Smith (123)456-7890
.
The account number may have a varying number of digits. Because the number of digits is important when matching a phone number, you want to distinguish numeric tokens based on the number of digits. You might assign symbols to the tokens as follows.
ID | TOKEN | SYMBOL |
---|---|---|
1 | 654 |
|
1 | John |
|
1 | Smith |
|
1 | ( |
|
1 | 123 |
|
1 | ) |
|
1 | 456 |
|
1 | - |
|
1 | 7890 |
|
A pattern matching this symbol sequence is: NUM3 WORD{2} LPAREN NUM3 RPAREN NUM3 DASH NUM4
.
However, this pattern will only match an account number with three digits. Assuming that you have defined the NUMBER symbol to signify 1-2 digits and 5+ digits, the following pattern will match account numbers with a variable number of digits: (NUMBER|NUM3|NUM4) WORD{2} LPAREN NUM3 RPAREN NUM3 DASH NUM4
.
But this strategy quickly becomes tedious when the number of symbol variants is large and used in multiple places in the pattern. To manage this situation, use Symbol Sets.
The Pattern Analyzer in the Pattern Match tool can discover symbol sequences in the input data and suggest patterns that match these symbol sequences.
Pattern Match tool configuration parameters
The Pattern Match tool has three sets of configuration parameters in addition to the standard execution options.
Patterns
Parameter | Description |
---|---|
Input group | Field that uniquely identifies each original record, as specified in the upstream Token Creation tool. |
Input token | Field containing the tokens previously assigned in the Token Creation tool. |
Input symbol | Field containing the symbols previously assigned in the Symbol Creation and/or Table Lookup tools. |
Output class | If specified, text field in the output record that will receive symbols. This is optional and defaults to new field CLASS. |
Perform pattern analysis | If selected, performs frequency analysis to discover patterns in the input data. |
Output matched pattern index | If selected, outputs the index numbers corresponding to the patterns specified in the Pattern grid. |
Output all fields | If selected, sends all input fields to output. |
Specify patterns | Patterns (Symbolic Regular Expressions) and their associated names. |
Symbol sets
Parameter | Description |
---|---|
Name | Identifier to replace in any pattern in which it is found. |
Symbol | Replacement symbol set to be substituted for Name. |
Library
Parameter | Description |
---|---|
Name | Identifier to replace in any pattern in which it is found. |
Symbol | Replacement sub-pattern to be substituted for Name. |
Configure the Pattern Match tool
Select the Pattern Match tool.
Go to the Patterns tab on the Properties pane.
Select Input group and choose the unique ID field you specified in the Token Creation tool.
Select Input token and choose the field containing the token values you want to match. These are the tokens previously assigned in the Token Creation tool.
Select Input symbol and choose the field containing the symbol values you want to match. These are the symbols previously assigned in the Symbol Creation and/or Table Lookup tools.
Optionally, select Output class and specify an existing or new field to receive the generated class values. By default, a new field named CLASS receives class values.
Optionally, select Perform pattern analysis to discover symbol sequences in the input data and suggest patterns that match these symbol sequences. See Pattern Analysis for details.
Optionally, select Output matched pattern index to output the index numbers corresponding to the patterns specified in the Pattern grid. This can aid in debugging pattern matching.
Optionally, select Output all fields to send all input fields to output. If it is cleared, only the Group, Token, and Class fields are output. Normally these additional fields are not needed, and you should clear this option to avoid processing extra data.
Select the Pattern grid and enter patterns and their associated names (descriptive text values). Each pattern must be a Symbolic Regular Expression composed of one or more top-level sub-expressions representing patterns of input symbols.
Optionally, select the Symbol Sets tab and define commonly used sets of symbols so that you can refer to the entire set by name in the patterns.
Optionally, select the Library tab and define commonly used sub-patterns so that you can refer to them by name in the patterns.
Optionally, go to the Execution tab, and then set Web service options.
Symbol sets
Symbol sets let you represent any number of symbols with a single symbol. You define symbol sets on the Symbol Set Configuration tab in the Pattern Match tab.
For example, suppose that you have a record consisting of an account number, followed by a name, followed by a phone number: 654 John Smith (123)456-7890
.
The account number may have a varying number of digits. Because the number of digits is important when matching a phone number, you want to distinguish numeric tokens based on the number of digits. You could assign symbols to the tokens as follows.
ID | TOKEN | SYMBOL |
---|---|---|
1 | 654 |
|
1 | John |
|
1 | Henry |
|
1 | ( |
|
1 | 123 |
|
1 | ) |
|
1 | 456 |
|
1 | - |
|
1 | 7890 |
|
A pattern matching this symbol sequence is: NUM3 WORD{2} LPAREN NUM3 RPAREN NUM3 DASH NUM4
.
However, this pattern will only match an account number with three digits. Assuming that you have defined the NUMBER symbol to signify 1-2 digits and 5+ digits, the following pattern will match account numbers with a variable number of digits: (NUMBER|NUM3|NUM4) WORD{2} LPAREN NUM3 RPAREN NUM3 DASH NUM4
.
But this strategy quickly becomes tedious when the number of symbol variants is large and used in multiple places in the pattern. Instead, create Symbol Sets. In this example, you could configure the first symbol set as follows.
NAME | SYMBOL1 | SYMBOL2 | SYMBOL3 | SYMBOL4 |
---|---|---|---|---|
ANYNUMBER | NUM3 | NUM4 | NUMBER |
Then you could rewrite the pattern as: ANYNUMBER WORD{2} LPAREN NUM3 RPAREN NUM3 DASH NUM4
.
Create symbol sets
Select the Pattern Match tool.
Go to the Symbol Sets tab.
In the Symbol set list, type the name of the first symbol set you wish to define. Define the symbols to associate with that name by typing them into the Symbols list.
Repeat step 2 to create additional symbol sets.
Pattern analysis
When working with the pattern tools to parse new data, often you won't know exactly what patterns are found within the data until you perform some analysis. Although the pattern-matching language is very powerful and expressive, you cannot use the language effectively until you know what is required by the data itself.
The Pattern Match tool operates on symbols assigned by the Symbol Creation or Table Lookup tools. The Pattern Match tool's Analysis module analyzes the patterns of symbol sequences found in the input data and the frequency of occurrence of those sequences. Often pattern analysis will reveal unexpected (and previously unknown) data elements. You can create a pattern expression directly from each symbol sequence reported on the Analysis tab.
Create patterns
The Pattern Analysis tab of the Pattern Match tool assists you in creating patterns that match the discovered symbol sequences in the input data.
Before you can use the results of pattern analysis, you must enable it.
To enable the Pattern Analysis tab
Select the Pattern Match tool.
Go to the Patterns tab on the Properties pane.
Check the Perform Pattern Analysis box.
Select Commit .
Run the project.
After execution has completed, select the Pattern Match tool, and then select the Analysis tab on the Properties pane.
For performance reasons, you should disable the Pattern Analysis tab once your project is finalized for production.
To create a pattern using the Analysis tab
Select Show and choose Unmatched. Unmatched symbol sequences are those that are not matched by existing patterns—you must create additional patterns to match them. Matched symbol sequences are those that are already matched.
In the Symbol sequences grid, select a symbol sequence by clicking on its row. A sample of records matching the selected symbol sequence appears in the box below.
Browse the sample records to make sure that all tokens in each sample record correspond to the correct symbols. You may find that you must create additional symbols using the Symbol Creation or Table Lookup tools in order to precisely represent the data patterns.
The Pattern creation helper grid displays symbols listed in the Symbols column and sample record tokens in the SampleN columns. If you're satisfied with the symbol/token correspondence, select the Class column in the Pattern creation helper and type the class names you wish to assign.
Select Create Pattern. The new pattern appears in the Pattern grid on the Patterns tab.
Repeat steps 2 through 5 for each unmatched symbol sequence that you want to match.
In the Symbol sequences grid, start with the top rows (most frequent symbol sequences) and work your way down.
Often you must repeat the sequence of running the project for analysis and creating patterns several times in order to capture all sequences. This is especially true if you change any upstream Token Creation, Symbol Creation, or Table Lookup tools as a result of your analysis.
Think of this process as interactive. The Pattern Match tool will give you information about the input data, and that information may cause you to re-think your configuration for Token Creation or Symbol Creation. This iterative process helps you define an accurate "model" for the data, in which a very high percentage of records can be parsed correctly.
Pattern libraries
The Library tab lets you associate sub-patterns with a name. It is more general than the Symbol sets tab, because it manages entire sub-expressions, not just sets of symbols. For example, if you were writing a customer parser for addresses, you might want to define a sub-pattern to match address numbers including fractions. You could define ADDRESS_NUMBER
as: ( LETTER | NUM | NUM SLASH NUM | NUM LETTER SLASH NUM )
.
Then you could use the ADDRESS_NUMBER
in other patterns by adding a # prefix. One such address pattern might be: #ADDRESS_NUMBER DIRECTIONAL? WORD+ DIRECTIONAL? SUFFIX?
.
The library is useful for keeping complex lists of patterns more concise and readable, and avoiding unintended divergence among copies of sub-patterns.
Create a pattern library
Select the Pattern Match tool.
Go to the Library tab.
Enter the Name of the first sub-pattern you wish to define. Define the sub-pattern to associate with that Name by typing it into the Pattern column.
Repeat step 2 to create additional sub-patterns.