Matching with the AOs
Overview
Data Management's Advanced Objects (AOs) are optimized for matching specific entity types, and include tools for fine-tuning the matching process and outputs.
Know your data
The Advanced Objects incorporate many best practices, but they aren't magic. Highly variant data can cause a matching process to run slowly, produce inaccurate results, or both. To prevent problems, you should perform some data analysis, or profiling, before creating a matching process.
Fields containing non-address data, such as phone numbers or email, often have default values embedded in the data. Matching processes typically filter these values out prior to performing matching. For example, when matching phone numbers, the AOs automatically filter out values that are all the same (0000000000
, 7777777777
). However, suppose your default value is 9876543210
. This looks like a real phone number and thus is not suppressed. But if you have 10,000 of these in a file, the matching time will be vastly increased.
You can get a feel for whether you have lots of "big segments" or invalid values you don't want to match by applying AO Match Segment Analyze to fields you want to use for segmentation and looking at the top 10-25 values. If you find problems, you have two options:
Modify the data in place
Make a copy of the data, remove the "junk," and then use the new field for the matching process
For example, if you found you had a large number of phone numbers with a valid area code, but the same value for the group and exchange (such as 303-555-5555
), you would implement a rule to ensure that this "bad" data not be used in the matching process. A simple way to do this is:
Using a Calculate tool, create a new
TEMP_PHONE
field the same size as the original one.Add a calculation of the form:
If right(PHONE,8)="555-5555" then "" else PHONE endif
.Use the
TEMP_PHONE
field in your matching process instead of thePHONE
field.
Segmentation
These are guidelines for optimizing segmentation settings on all matching AOs except AO Business Name-Only Match.
In matching, segmentation means compare two records only if some specified thing matches exactly. For example, if you have a set of new records from Colorado that you want to match with a national data set using the address field, you can improve the efficiency of the matching process by only comparing records from Colorado. If the new records are only from Boulder, you would remove all non-Boulder records from the national data set prior to matching. Reducing the number of possible candidates for comparison greatly reduces processing time.
How do you decide what data to use for segmentation? For address-based matching, there are four pre-defined segmentation strategies:
ZIP
ZIP + Part of Street Name
ZIP + Part/All of Street Number
ZIP + Part of Street Name + Part/All of Street Number
To decide which strategy to use, ask yourself the following questions:
What data must match for a record to be considered a match candidate?
Have I standardized my data?
Can I afford to take a little more time to get more "fuzzy" matches?
Your choice will depend on what types of variations you want to find. For example, if the street number must match exactly, use some or all of it for part of the segment. If the street name is of great importance, make part of the name a reason not to compare records. If both are required to match exactly, then use a combination of the two. If you want to define a segment in some other way, use a Calculate or Splice tool to create it as a new field and then select the custom field segment option to use it.
How much does segmentation affect processing time? The following example is for illustration purposes and does not reflect actual benchmarks.
Number of rows: 17,460,659
Option 1: Segmenting by ZIP and the first three bytes of the street name
Field | Example |
---|---|
Total number of segments | 2,660,143 |
Average size per segment | 7 |
Number of segments >1000 rows | 4 |
Total processing time | 14:54 |
Option 2: Segmenting by ZIP only
Field | Example |
---|---|
Total number of segments | 42,380 |
Average size per segment | 412 |
Number of segments >1000 rows | 6,016 |
Number of segments >2000 rows | 2,155 |
Number of segments >4000 rows | 123 |
Total processing time | 5:55:18 |
Tuning
In matching, there is a trade-off between quality and quantity: more matches or better matches? You can use match score settings to balance these two goals. The higher the match score number (0-100), the more precisely the records need to match. If you're looking for more matches, set lower match scores. A little trial and error can help you find the optimal settings.
Example: Tuning a consumer data matching process
Create a new Data Management project in which data is first selected, prioritized, and standardized. It is then fed into the first AO Consumer Match process. Set the Match scores to a loose configuration. (There is no "correct" way to configure the low or high end of the settings; we've assigned descriptors such as Medium and Tight, with corresponding values to make tuning decisions simpler.) In the Reporting section, turn on the Output match info option. Finally, attach a Select tool to label all the ID-related information as results of the "Loose" matching process. Connect this to a second AO Consumer Match, and set its Match scores as a tight configuration. As with the loose matching, attach a Select tool to tag the ID-related data as "Tight."
If looser/tighter match thresholds are the only difference between the two match processes, you can insert a Filter tool before the second AO Consumer Match to filter out unmatched records before the second matching process.
In the final step, the second matching process is attached to AO Match Group Differences. The "Loose" and "Tight" fields passed through from the matching processes are specified as First ID and Second ID parameters. Specify Same or different counts? as Differences, and sorted the output. Attach a DLD Output tool file to receive the results.
Copy the matching and match group differences elements into a new project. Attach the DLD file produced by the initial project as input, and connect a Select tool to it. Remove all "ID" related fields. Replace the DLD Output tool with a Data Viewer tool, and then run the project. Examine the result in the Data View tab. The matches are sorted in order. As you review the data, determine whether you are missing duplicates or getting spurious matches. If you are missing dupes, loosen the match thresholds on the second AO Match. If you are getting matches you should not be getting, raise the match thresholds on the first AO Match. Rerun the project. Once you are satisfied with the results, copy the optimized AO Match, and paste it into your main project. Remove the two existing match processes (and their Select tools) and the AO Match Group Difference. Connect the optimized AO Match to the input and output streams.
Tuning other matching processes
AO Business Name-Only Match and the word-by-word algorithm in AO Match Building Block use a different system to structure matching. Instead of a "Loose to Tight" matching scale, these AOs let you weight or adjust the importance of variation in name component values. The parameters Initials, Acronyms, Abbreviations, and Missing Words accommodate more variation as the numbers get higher. Treat as Same is assigned a weight of 100; Treat as Very Similar a weight of 85, and so on down to Treat as Different (0). These settings indicate how much "help" variant values get during scoring.
The following example shows how Abbreviations match scores vary with different Adjustment values.
Record 1 | Record 2 | Adjustment Value | Match Score |
---|---|---|---|
|
| 0 | 80 |
|
| 0 | 70 |
|
| 50 | 90 |
|
| 50 | 83 |
|
| 75 | 95 |
|
| 75 | 91 |
|
| 100 | 100 |
|
| 100 | 100 |
Record 2 is an abbreviation of Record 1. As we increase the Adjustment value, we are "taking back" some of the score lost due to the variation. In a sense, with an Adjustment value of 50, we take back 50% of the match score lost due to variation.