Summarize and Summarize2
Overview
The Summarize2 tool is similar to the Summarize tool, but is optimized for different uses:
Summarize2 does not produce running summaries.
There is no Ignore errors and nulls option. Error and null values are always ignored.
The Summarize2 tool performs better and uses less temporary disk space than the Summarize tool if the output data size is small compared to the input.
Use the Summarize2 tool rather than Summarize if one of the following applies:
The expected output data is small compared to the input (a 50:1 reduction or better).
You are using it for field/value frequency counts on columns with many repeated values.
Use the Summarize tool if one of the following applies:
The expected output data size is larger than 2% of the input data size.
You need running summaries.
You are mostly performing concatenate operations.
The performance advantage of the Summarize2 tool over the Summarize tool derives solely from the fact is that it does not need to sort the input records by the group keys. If you have no group keys, or if the input is already ordered by the group keys, Summarize will usually be faster and use less memory than Summarize2.
Summarize
The Summarize tool groups records by a set of keys, and aggregates all records with the same key according to the operations that you specify. For example, suppose you have a sales history file containing Customer, Territory, Product, and Amount fields. You could group by Customer and Product, while totaling Amount, giving you a summary of sales by customer and product.
Summarize is a general-purpose tool that can be used for:
Performing RFM (Recency, Frequency, Monetary) customer analysis.
Generating pre-aggregated tables to assist in reports on large databases.
The Summarize tool does not find the record with the maximum or minimum value for a field. See Finding the maximum or minimum record.
Summarize tool configuration parameters
The Summarize tool has one set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Ignore null and error values | Select to ignore nulls and errors encountered when computing Total, Average, Minimum, or Maximum. This is optional and enabled by default. |
Output all fields | If selected, all input fields are included on output. Otherwise, only the group and summary fields are output. This is optional and disabled by default. |
Running summaries | If selected, generates running summaries instead of group summaries; all records are included in the output. This is optional and disabled by default. |
Keep input field types | If selected, fields are not expanded in size to avoid potential overflows. This is optional and disabled by default. |
Group-by field | List of field names that define how records are grouped. All contiguous records having the same value across all of the Group-by field belong to the same group. If you specify no fields, the entire table is summarized, producing a single group. |
Input | Name of the field to be summarized. This is both the Input and Output, unless Output is also specified. |
Operation | Type of summary operation to be performed. Can be one of: Total, Average, Minimum, Maximum, Standard Deviation, Variance, Count, Count nulls, Count errors, Longest, First valid, Last valid, or Concatenate. Defaults to Total. |
Output | If specified, name of the output field. This allows multiple operations to be performed on one field. If omitted, the output field is the same as the input field. |
Configure the Summarize Tool
Select the Summarize tool, and then go to the Configuration tab on the Properties pane.
Clear Ignore null and error values if you want the summary output value for any group containing a null or error value to be null or error (the exception is the Count summary operation: if you are counting records in a group, nulls and error values are always counted).
Select Output all fields if you want to output all fields, rather than only the fields being grouped and summarized.
If you select this option without checking the Running summaries option, the extra output fields will be selected from the first record of each group. Because of this, you usually want to select this option only when you also check the Running summaries option. See About running summaries.
Select Running summaries if you want a running total of the field being summarized.
Running summaries can be used to produce a "checkbook register" style output, where every transaction is output with the results added to the running total.
Select Keep input field types to require that input fields used in a summary have the same type on output as on input. If this is cleared, output fields will be made larger than the input fields where it makes sense (the Count operation is an exception; it always produces integers). For example, Total of a 4-byte integer will create an 8-byte integer.
Under Group-by fields, select in the Field grid and select one or more fields to be summarized.
If you select more than one field, grouping will be primarily by the field on the first row, secondarily by the second row, and so on. If you specify no fields, the entire table is summarized, producing a single group.
In the Field operations grid, define each summary operation on a separate row:
Select the Input column of the row. Select a field from the pull-down list, or type a new field name.
Select the Operation of the row and select the operation to perform on the specified field.
Operation | Description | Allowed field types |
---|---|---|
Total | Totals the values within a group. | Fixed-point, Floating-point. |
Average | Averages the values within a group. | Fixed-point, Floating-point. |
Minimum | Finds the minimum valid within the group. The minimum value is the one that would sort first. | Fixed-point, Floating-point, Date, Time, DateTime, and Text. |
Maximum | Finds the maximum valid within the group. The minimum value is the one that would sort last. | Fixed-point, Floating-point, Date, Time, DateTime, and Text. |
Standard Deviation | Computes the standard deviation of the group*. | Numeric. |
Variance | Computes the variance of the group. | Numeric. |
Count | Counts the number of records in a group. | Fixed-point, Floating-point. You can specify a new field for Count. |
Count nulls | Counts the number of Null-valued fields records in each group. | Any type. |
Count errors | Counts the number of Error-valued fields records in each group. | Any type. |
Longest | Finds the longest string in the group. | Variable-length text. |
First valid | Find the first valid (not Null or Error) value in the group. Blank text fields are treated as Null. | Any type. |
Last valid | Find the last valid (not Null or Error) value in the group. Blank text fields are treated as Null. | Any type. |
Concatenate | Combine all the text values of the group into one large string by joining them sequentially. | Text. |
* The standard deviation formula used is:
Total and Average operations can only be performed on numeric fields. (Dates are not considered numeric.) You can find the Maximum and Minimum for fields of types Date, Time, DateTime, and Text. For Text types, Minimum and Maximum are interpreted lexically.
Optionally, select the Output column of the row and specify a field name for the operation result.
Optionally, go to the Execution tab, and then set Web service options.
The Count operation can seem confusing, since you must specify a input field even though the field value will be ignored. The easiest way to handle counting is to type a new field name for the count field (for example, NUMBER_OF_SALES
). The Summarize tool will automatically create a new count field that is large enough to hold the results.
Note that the Summarize tool will sort the input on the grouping fields (if input is not already sorted). Because of this, you must have enough temporary disk space to hold the sorted table.
About running summaries
Running summaries are most intuitive when you think of running totals. In a running total, if the input is:
1.1
2.0
12.1
then the running total will be:
1.1
3.1
15.2
(A non-running total would simply output the final 15.2
)
Running averages, running counts, and running maximums/minimums work in the same way.
A running total always generates one output record for each input record, whereas a non-running total generates a record only at the end of a group, as defined by a set of records with identical group-field values.
With a normal ("non-running") summary, you need only output two sets of fields: the grouping fields, and the fields that were summarized. Since only one record is being generated per group of input records, outputting all fields results in a somewhat arbitrary sample of the input records (the field values for the last record in each group). This may be okay if you want to select the last values per group, or if the other field values are the same.
With a running summary, by contrast, it makes more sense to output all the fields, because one output record is generated for each input record. You should select the Output all fields option in the Summarize tool's configuration dialog box.
Summarize2
The Summarize2 tool is similar to the Summarize tool, but is optimized for different uses:
Summarize2 does not produce running summaries.
There is no Ignore errors and nulls option. Error and null values are always ignored.
The Summarize2 tool performs better and uses less temporary disk space than the Summarize tool if the output data size is small compared to the input.
Use the Summarize2 tool rather than Summarize if one of the following applies:
The expected output data is small compared to the input (a 50:1 reduction or better).
You are using it for field/value frequency counts on columns with many repeated values.
Use the Summarize tool if one of the following applies:
The expected output data size is larger than 2% of the input data size.
You need running summaries.
You are mostly performing concatenate operations.
The performance advantage of the Summarize2 tool over the Summarize tool derives solely from the fact is that it does not need to sort the input records by the group keys. If you have no group keys, or if the input is already ordered by the group keys, Summarize will usually be faster and use less memory than Summarize2.
Summarize2 tool configuration parameters
The Summarize2 tool has one set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Output all fields | If selected, all input fields are included on output. Otherwise, only the group and summary fields are output. This is optional and disabled by default. |
Keep input field types | If selected, fields are not expanded in size to avoid potential overflows. This is optional and disabled by default. |
Treat null as blank/zero | If selected, null and error values are treated as blank or zero, rather than as missing data. Enabling this option will change the results of some Operations:
|
Group-by fields | List of field names that define how records are grouped. All contiguous records having the same value across all of the Group-by field belong to the same group. If you specify no fields, the entire table is summarized, producing a single group. |
Input | Name of the field to be summarized. This is both the Input and Output, unless Output is also specified. |
Operation | Type of summary operation to be performed. Can be one of: Total, Average, Minimum, Maximum, Standard Deviation, Variance, Count, Count nulls, Count errors, Longest, First valid, Last valid, or Concatenate. Defaults to Total. |
Output | If specified, name of the output field. This allows multiple operations to be performed on one field. If omitted, the output field is the same as the input field. |
Configure the Summarize2 Tool
Select the Summarize2 tool, and then go to the Configuration tab on the Properties pane.
Select Output all fields if you want to output all fields, rather than only the fields being grouped and summarized.
If you select this option without checking the Running summaries option, the extra output fields will be selected from the first record of each group. Because of this, you usually want to select this option only when you also select the Running summaries option. See running summaries.
Select Keep input field types to require that input fields used in a summary have the same type on output as on input. If this is cleared, output fields will be made larger than the input fields where it makes sense. (The Count operation is an exception; it always produces integers.) For example, Total of a 4-byte integer will create an 8-byte integer.
Select Treat null as blank/zero to interpret null and error values as blank or zero, rather than treating them as missing data and ignoring them.
Under Group-by fields, select the Field grid and choose one or more fields to be summarized.
If you choose more than one field, grouping will be primarily by the field on the first row, secondarily by the second row, and so on. If you specify no fields, the entire table is summarized, producing a single group.
In the Field operations grid, define each summary operation on a separate row:
Select the Input column of the row. Select a field from the pull-down list, or type a new field name.
Select the Operation of the row and select the operation to perform on the specified field.
Operation | Description | Allowed field types |
---|---|---|
Total | Totals the values within a group. | Fixed-point, Floating-point. |
Average | Averages the values within a group. | Fixed-point, Floating-point. |
Minimum | Finds the minimum valid within the group. The minimum value is the one that would sort first. | Fixed-point, Floating-point, Date, Time, DateTime, and Text. |
Maximum | Finds the maximum valid within the group. The minimum value is the one that would sort last. | Fixed-point, Floating-point, Date, Time, DateTime, and Text. |
Standard Deviation | Computes the standard deviation of the group*. | Numeric. |
Variance | Computes the variance of the group. | Numeric. |
Count | Counts the number of records in a group. | Fixed-point, Floating-point. You can specify a new field for Count. |
Count nulls | Counts the number of Null-valued fields records in each group. | Any type. |
Count errors | Counts the number of Error-valued fields records in each group. | Any type. |
Longest | Finds the longest string in the group. | Variable-length text. |
First valid | Find the first valid (not Null or Error) value in the group. Blank text fields are treated as Null. | Any type. |
Last valid | Find the last valid (not Null or Error) value in the group. Blank text fields are treated as Null. | Any type. |
Concatenate | Combine all the text values of the group into one large string by joining them sequentially. | Text. |
* The standard deviation formula used is:
Total and Average operations can only be performed on numeric fields (dates are not considered numeric). You can find the Maximum and Minimum for fields of types Date, Time, DateTime, and Text. For Text types, Minimum and Maximum are interpreted lexically.
Optionally, select the Output column of the row and specify a field name for the operation result.
Optionally, go to the Execution tab, and then set Web service options.