Cleaning messy data
Overview
The projects in the repository folder Samples\Basic\Cleaning messy data
demonstrate techniques for:
Enforcing consistent field values
The sample project cleansing1 is a "discovery" step that you'll often need to perform before field-value cleansing. This example reads a sample "clothing order" table with typical data entry problems—inconsistent case and spelling errors. The project first converts the ITEM, SIZE, and COLOR fields to upper-case, and then generates CSV files containing the frequency counts of all unique values found in those fields. You can edit the resulting CSV files using Excel or a text editor, and transform them from a frequency-counts file into a lookup table file. The frequency counts help suggest which field values are real and which are invalid (presumably the common ones are real and the infrequent ones are invalid).
You need to change the output file paths to something writable on your system.
Once you create the lookup tables for field cleansing, the next step is to create and run a project to perform the field value replacement. The sample project cleansing1b performs a "value cleansing" on a database of clothing orders, using lookup tables created using the process described in the first step above. The project uses lookup tables for the ITEM, SIZE, and COLOR fields to replace invalid field values with the corresponding correct values.
Finding all records matching a table of values
In cleansing operations, you often want to find non-conforming records and flag them for review. The sample project filter records on value set finds all records in which a specified column contains one of a predefined set of "acceptable values". The project also produces the complementary set of records—those not containing one of the values. There are two ways to accomplish this, both shown in the example. Both methods rely on a file containing a list of all acceptable values (provided for this example).
In the first method, a Table Lookup tool determines whether each field value is in the list of acceptable values. Then the project filters records based on the result of the table lookup. Records that do not have acceptable values get a lookup result of Null. This method is good when your list of acceptable values is small—25,000 values or less. However, if your list is too large, the Table Lookup tool consumes too much memory.
The second method uses a Join tool to match your records against the list of valid values. The J (inner join) output of the Join tool contains all of the acceptable records. The L (left outer join) output contains all of the unacceptable records. This method works even with very large lists of acceptable values.
This example is case-sensitive. To make it case-insensitive:
Edit the list of values in
acceptable_values.csv
, making all values uppercase.Insert a Calculate tool before the D input of the Table Lookup tool (in the first example) or the L input of the Join tool (in the second example).
Configure the Calculate tool to convert the COLOR field to uppercase using an expression like
UpperCase(COLOR)
.
Dealing with legacy composite keys
If you've worked with data long enough, you've seen database files with ugly composite keys. For example, the KEY field in a child table might have the key to the parent record as its first six characters, the date added as its next six characters, and a sequence number within the date as the final two characters. The sample project legacy composite keys breaks apart a "composite key" from a legacy table into its constituent parts.
Using Join for table lookup
The sample project join table lookup demonstrates looking up a city name using a ZIP Code. Lookup tables are handy for conversions of the form "given A, produce B". For example:
Given an SIC code, find the description.
Given a city, find the population.
If the reference tables for such lookups are very large, the Table Lookup tool may use more memory than you like, because it stores the entire table in memory. Using Join for table lookups is a more memory-efficient approach. It's slower, but if you pre-sort the lookup table on the key field, it will be reasonably fast.
Using Join to verify relational integrity
The sample project verify relation shows how to verify the relational integrity of a database by observing the left and right "outer" join outputs of the Join tool. The Join tool provides three outputs:
The join results
The records of the left input that did not join to records of the right input
The records of the right input that did not join to records of the left input
When you join two related tables on a key field, you expect all records to join. Failure to join indicates errors in the data. By exposing unmatched records, you can verify that table relationships are sound.