Fuzzy matching and deduplication

Overview

The projects in the repository folder Samples\Basic\Fuzzy matching and deduplication demonstrate techniques for:

Aggregating data from removed duplicates

The example aggregate duplicates removes matches records on PHONE, removes duplicates, and aggregates some information from the removed records into the preserved records. For each unique record preserved, it also stores the number of duplicates that were rejected in the field DUP_COUNT, and stores the FIRSTNAME and LASTNAME of the first duplicate record in the fields FIRST_DUP and LAST_DUP.

This task is often a necessary step in deduplication because frequently you want to preserve some information from the discarded records. This may be aggregate information (as in the DUP_COUNT field) or single-field information (as in the FIRST_DUP field). Note that this technique can be combined with fuzzy-deduplication logic as well.

Using address parsing to improve deduplication results

The sample project dedup addr parsing builds on the "core" deduplication logic, augmenting it with address parsing. Address parsing is not address standardization. Instead of correcting addresses in accordance with the USPS database, Data Management normalizes certain common variants in the address, parses the address into its component parts, and generates one or more plausible candidate versions of an address. It generates multiple candidates because there is ambiguity in an address that cannot be resolved without the referring to the USPS database (and sometimes not even then). However, multiple candidates are perfect for matching projects, because they give the same record multiple opportunities to match against other records—one opportunity for each candidate.

Address parsing is useful when you have very dirty addresses, or when you are performing matches based solely on the address.

Consolidating child records when parent records are removed

The sample project dedup consolidate children builds on the "core" deduplication logic, extending it by consolidating the child records of the duplicates identified. This example uses the customer.csv table and the associated cust_sales.csv table. The two tables are linked by the CUSTID field. We discard duplicates from the customer.csv table and link the cust_sales.csv records of the discarded customers to the surviving customer records.

For simplicity we only use one sort/compare pass in this example. Normally, you would use multiple passes to improve match quality. See the deduplicate example for more on the use of multiple passes.

Basic deduplication with "fuzzy matching"

The sample project deduplicate demonstrates the core concepts of fuzzy-logic duplicate-detection projects. The Data Management approach to duplicate-detection involves the following steps:

Mark each record with a unique ID, and select fields for comparison.
Sort and compare the records using multiple passes. Each pass sorts the records so as to put likely duplicates next to each other, then performs a sliding-window fuzzy comparison down the list of sorted records. Each sort/compare pass produces a table of ID pairs, one pair for each record match detected.
Merge all ID pairs from the sort/compare passes, and turn them into ID groups.
Join the ID groups to the original data. This step may be omitted if you just want a list of IDs of the duplicate records.

Deduplication with confidence factor

The sample project deduplicate confidence differentiates matches by the quality (or tightness) of the match. It splits the sort/compare phases into "high" and "low" quality comparisons. After the match pairs have been grouped, the groups are augmented by a CONFIDENCE field, which is set to HIGH for those records that matched using tighter match thresholds.

Householding multiple tables using "fuzzy-matching"

The sample project matching demonstrates fuzzy-matching between multiple databases, a process sometimes termed "householding". A common use of inter-table matching is appending information from a reference database to a customer or prospect list. The augmented list can then be used for targeted marketing. Reference databases for consumers are available from companies such as Experian and Trans Union. For businesses, the reference databases may be from InfoUSA or Dun & Bradstreet.

Fuzzy-matching is very similar to deduplication, except that the project seeks matches between tables instead of within a single table.

Once the matches are identified, you would normally perform additional processing. This example simply appends key fields to both tables. These new key fields define matching relationships between the tables that can be used for querying or further processing.

Merge/purging a set of mailing lists

The sample project merge purge demonstrates a typical "merge-purge" project, in which multiple mailing lists are input into the system, merged, and deduplicated using "fuzzy-matching" techniques.

Often a merge-purge job must support prioritization of input lists—some are deemed more important than others, so duplicates are removed from lower-ranking lists before higher-ranking ones. This sample project used a PRIORITY ranking to preferentially retain records from the left input over records from the right input.

Removing exact duplicates

Removing exact duplicates is a simple process. The sample project unique1 demonstrates identifying and removing duplicates from a mailing list using the fields AREA_CODE and PHONE.

For name/address lists, this type of deduping is not as effective as fuzzy-matching duplicate removal. However, removing duplicates on a "hard" key (e.g., Social Security Number or Tax ID) is quite effective.