Optimize based on record order
Data Management automatically applies sort optimization rules, so usually you need do nothing to optimize performance based on record order. But if you are after the highest performance, there are some cases where you can make a difference by changing your project.
The Summarize, Join, Rank, and Unique tools sort the input records as a part of their functions. If the input record order matches their sort keys, they run faster.
When processing a large volume of data with these tools, compare the record order displayed on the input connection's schema to the tool's order-sensitive properties:
Tool | Order-sensitive properties |
---|---|
Sort | Sort specification |
Join | Join keys (ascending) |
Unique | Unique-by specification |
Rank | Rank field name (ascending) |
Summarize | Group field names (ascending) |
If tool properties don't match the input record order but you think that they should, there are several things that you can do to make them match, and thus run faster:
First, realize that most tools output newly-created or edited fields in no particular order. The following tools are the exceptions:
Tool | Ordered output |
---|---|
Generate Sequence | ID field, if Repeat is not selected |
Number Records | Record ID field and Group/Member ID fields |
Sort | Sort specification fields |
Unique | Unique-by fields |
Join | Join key fields, as altered by selection and renaming |
Summarize | Group field names |
Unique | Unique-by fields |
Rank | Rank field name |
Token Creation | ID field, if defined |
Table to Stream | ID field, if defined |
Pair to Group | Order output (key or group) |
In all of the cases below, Data Management loses record order information.
Input data: Data Management cannot infer sort order of records being read from files and databases, except for DLD files.
Calculate tool: Fields assigned in the Calculate tool lose all record order information.
Change Field Types tool: Fields changed in the Change Field Types tool lose all record order information except for some special cases.
If you know that your data is sorted in a specific record order, and you want to rely on that order to speed things up, you can use a Validate Order tool to signal downstream tools that the data is in the specified order. In addition, the following techniques may help to keep or create record orders:
Order of operations: If you are using multiple sorting tools in sequence (for example, a Join followed by another Join, or a Join followed by a Summarize), you may be able to reorder these tools to match the record order present on input while retaining the processing logic.
Generate Sequence tool: The ID field created by the Generate Sequence tool will not be ordered if the Repeat after end value is reached option is selected. Do not enable this option unless you really need it.
Merge tool: Where possible, preserve a known record order by using a Sequence or Sorted merge type.