Basic tasks
Overview
The projects in the repository folder Samples\Basic\Basic tasks demonstrate demonstrate techniques for:
Creating a unique ID
Many data processing operations require that each record in a database have a unique identifier. The Generate Sequence tool provides a simple way to create unique IDs, as the sample project unique key demonstrates.
Changing field order
The sample project change field order demonstrates changing the order of fields within a record. You may want to change field order if:
You are exporting data to a Flat file or CSV file and sending it to someone who expects data with a particular record layout.
You are exporting data to a flat file or CSV file which will be loaded into an RDBMS.
You are integrating several data sources and want them all to have the same layout.
Changing field types
The sample projects change field types and change field types2 demonstrate changing the types of fields. You may want to change field types if:
You are exporting data to a flat file or CSV file and send it to someone who expects data with a particular record layout.
You are exporting data to a flat file or CSV file which will be loaded into an RDBMS.
You are integrating several data sources and want them all to have the same schema.
The sample project change field types performs a simple type conversion, while change field types2 demonstrates a more complex conversion.
Denormalizing a data set
Relational databases typically store data in normalized form to minimize the inclusion of duplicate information. Sometimes, however, you may want to denormalize or "flatten" a set of data.
In the sample project denormalize, we have a customer address table where multiple addresses are present for each unique account ID. We want to denormalize or "flatten" the table so that each account ID is associated with up to four addresses—address1, address2, and so on. The example assumes that the ID field correctly identifies groups of customers. If not, you can use Data Management's fuzzy-matching capabilities to find duplicate groups.
Generating a layout file
The sample project extract layout generates a CSV output file and a companion "layout" file containing the layout information (the metadata) for the output file. This is useful if you are sending the file to someone else and want to document the file layout for the recipient.
This is also a useful technique if the resulting file will be read by Data Management, because both the CSV Input and Flat File Input tools can directly import layout files using the "Import Layout" feature.
This example does not include type information in the layout file. However, you can include type information by setting the appropriate options in the Extract Layout tool.
Processing a subset of data within a table
The sample project filter_merge_sequence
shows how to perform an operation on a subset of data within a table, using a Filter tool to select the desired data and assign consecutive numbers to all records. A downstream Merge tool set to Sequence merge type uses these numbers to reassemble the split data stream in the original order.
Using expressions to split input data
The sample project filter_multi
shows how to define expressions in the Filter tool that split input data into multiple data streams.
Merging multiple tables of different formats
The sample project merge demonstrates merging two CSV tables that contain the same data fields, but with different formats and layouts:
The fields are in different orders.
DATE formatting is different.
One AMOUNT field has a decimal, while the other has an implied scale-by-100.
This example reconciles the data into a coherent format and generates a single output.
Overwriting data files
Sometimes it is desirable to read a file, alter it, and write it back out (for example you are purging some records from the file).
Be cautious when reading and writing the same file in a single project!
There are two ways this operation can go wrong:
Your project is not programmed correctly, and you overwrite the file with the wrong data.
There is no "buffer" between the input and output, so the project attempts to read and write the file simultaneously. The job will fail, because as soon as the output tool starts writing, it will truncate the file that the input is still reading.
To avoid these problems:
Always have your project create a backup copy of the file.
Ensure that there is a "buffer" between the input and output—a tool that reads the entire input before writing the output. Tools that do this are Sort, Rank, Unique, Join, and Summarize.
Ranking input records by value
The sample project RankPercentiles
shows how to rank input records by value, identifying the first records ranked at the 5th and 95th percentiles, and then replacing values falling within the top and bottom fifth percentiles with standardized values.