CSV Direct/Input/Output

CSV Direct

The CSV Direct tool behaves almost identically to the CSV Input tool except that instead of reading data from a file, it reads from data that is stored in the project itself. Therefore, all of the options defined for the CSV Input tool also apply to the CSV Direct tool, and they are configured identically except for the input source.

CSV Direct tool configuration parameters

The CSV Direct tool has two sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Data	CSV-formatted text. You are limited to 32k of data.
Use repository schema	If selected, configure the field layout using the specified Schema instead of configuring fields directly.
Schema	If Use repository schema is selected, a schema must be specified.
Separator	The field delimiter. Can be specified by Character or Code. This is optional and defaults to a comma.
Quote character	If defined, one or more characters that quote or escape the data. If no quote character is specified, there is no escapement. If one of {[ or < are specified, the corresponding closing bracket is treated as the closing quote.
Skip lines	If defined, skip leading lines in the file, such as field names contained in a first line. This is optional and defaults to 0. This is not intended for skipping data - use the Last N tool for this.

Options

Parameter	Description
Allow newlines in fields	If selected, files with embedded newlines in the fields are accepted as input and any newlines encountered are ignored.
Trim	If selected, leading/trailing whitespace is removed from input data.
Limit records	If selected, limits the number of records read.
Read only the first	If Limit records is selected, specifies the number of records to be read.
Error handling	Configures how errors in the input data will be handled: Ignore error: accept the record and ignore the error. Issue warning: accept the record but send a warning to the Message Viewer and/or Error Report connector. Reject record: reject the record and send a warning to the Message Viewer and/or Error Report connector. Abort project: abort the project.
Parsing errors	Specify how to handle input containing invalid XML that cannot be processed. The Issue warning option is not available for this type of error.
Conversion errors	Specify how to handle input data that cannot be correctly converted to its specified data type (such as converting "Jane Smith" to a Date).
Truncation errors	Specify how to handle input with truncated field values (such as storing "John Smithson" in a Unicode(1) field).
Missing fields	Specify how to handle input with fields that are missing or contain empty values.
Extra fields	Specify how to handle input with extra elements that are not defined in the tool's configuration.
Send errors to Message Viewer	If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer. The Report warnings option must be selected on the Execution tab. If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer, even if the Report warnings option on the Execution tab is not selected.
Send errors to Error Report connector	If selected, creates an E "Error Report" output connector. All error types configured to either Issue warning or Reject record will send detailed error messages to that connector. Attach a downstream Data Viewer or other output tool to review these messages.

Configure the CSV Direct tool

Select the CSV Direct tool.
Go to the Configuration tab on the Properties pane.
Enter CSV-formatted text directly into the edit box. You are limited to 32k of data; for larger volumes you should use an input file or repository data object.

Configure properties as for a CSV Input tool.

CSV Input

The CSV Input tool reads a delimited text file (including gzip-compressed CSV files with .gzip and .gz extensions), and outputs records from the file to the downstream tools. You can configure the number of fields, and the size and type of each field. To automate field definition, you can specify a layout file or schema containing field names, types, and sizes. The CSV Input tool supports any delimiter (tabs or semicolons, for example), not just commas. In addition to specifying the fields and the delimiter, you can set options to accommodate common variations in this format. Some examples are:

Field names on the first line of the file.
"Short" records that truncate trailing fields.
Missing fields in some records.
Newline characters in some fields.
Text encoding other than ASCII or Latin-1.
Files that do not exactly match the expected schema.

You should always check the Report warnings option on the Execution tab of the CSV Input tool so that you will be advised of short records, truncated fields, and other problems common to the CSV format.

CSV Input tool configuration parameters

The CSV Input tool has three sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Compression type	Compression strategy to use for input file. Options are: Autodetect: ignores the file extension and attempts to detect whether the file is compressed. Gzip: specifies that the file is compressed using Gzip. Uncompressed: specifies that the file is not compressed. Use extension: uses the file extension to detect whether the file is compressed. If not otherwise specified, Data Management may attempt to autodetect the compression format upon Analyze.
Input file	The file containing the records, or a wildcard pattern matching multiple files.
Use repository schema	If selected, configure the field layout using the specified Schema instead of configuring fields directly.
Schema	If Use repository schema is selected, a schema must be specified.
Code page	Name of the encoding of the source data. This is optional and defaults to Latin-1. See About code pages.
Separator	The field delimiter. May be specified by String or Code. This is optional and defaults to a comma.
Quote characters	If defined, one or more characters that quote or escape the data. If no quote character is specified, there is no escapement. If one of {[ or < are specified, the corresponding closing bracket is treated as the closing quote.
Allow newlines in fields	If selected, files with embedded newlines in the fields are accepted as input and any newlines encountered are ignored.
Skip lines	If defined, skip leading lines in the file, such as field names contained in a first line. This is optional and defaults to 0.
Trim	If selected, leading/trailing whitespace is removed from input data.

Flexibility

Parameter	Description
Enable flexibility	If selected, allows the tool to read files that do not exactly match the schema, provided the file has a header row.
New field action	If Flexibility is enabled, specifies how fields not present in the schema are handled. Choose Ignore if the file is allowed to have new fields that were not defined in the schema, and to ignore them if found. Choose Error if the input file is not allowed to have fields not defined in the schema.
Missing	If Flexibility is enabled, indicates the action to take for each field when that field is not present in the CSV file header. Options are: Null value: sets the field value to null for all records. Default value: sets the field value to the default value for all records. Fail to issue an error and stop: fails to issue an error.
Aliases	If Flexibility is enabled, a string containing a comma-separated list of allowed alternate field names for field.
Produce file name field	If selected, the file name will be output as a record field.
Output full path	If Produce file name field is selected, optionally outputs the entire path to the record file name field.
Output URI path	If Output full path is selected, express path as a Uniform Resource Identifier (URI).
Field name	If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to `FILENAME`.
Field size	If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to `255`.

Options

Parameter	Description
Limit records	If selected, limits the number of records read.
Read only the first	If Limit records is selected, specifies the number of records to be read.
Produce file name field	If selected, the file name will be output as a record field.
Output full path	If Produce file name field is selected, optionally outputs the entire path to the record file name field.
Output URI path	If Output full path is selected, express path as a Uniform Resource Identifier (URI).
Field name	If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to `FILENAME`.
Field size	If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to `255`.
Error handling	Configures how errors in the input data will be handled: Ignore error: accept the record and ignore the error. Issue warning: accept the record but send a warning to the Message Viewer and/or Error Report connector. Reject record: reject the record and send a warning to the Message Viewer and/or Error Report connector. Abort project: abort the project.
Parsing errors	Specify how to handle input containing invalid XML that cannot be processed. The Issue warning option is not available for this type of error.
Conversion errors	Specify how to handle input data that cannot be correctly converted to its specified data type (such as converting "Jane Smith" to a Date).
Truncation errors	Specify how to handle input with truncated field values (such as storing "John Smithson" in a Unicode(1) field).
Missing fields	Specify how to handle input with fields that are missing or contain empty values.
Extra fields	Specify how to handle input with extra elements that are not defined in the tool's configuration.
Send errors to Message Viewer	If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer. The Report warnings option must be selected on the Execution tab. If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer, even if the Report warnings option on the Execution tab is not selected.
Send errors to Error Report connector	If selected, creates an E "Error Report" output connector. All error types configured to either Issue warning or Reject record will send detailed error messages to that connector. Attach a downstream Data Viewer or other output tool to review these messages.

Configure the CSV Input tool

To configure a CSV Input tool, you need the following information about the input file:

The input file name
The field delimiter character
The quote character(s), if any
The list of fields and their types

Data Management provides several ways to specify this information.

If you have...	Do this
A schema for this file already defined in the repository.	Configure the CSV Input tool using a repository schema.
A layout file.	Configure the CSV Input tool using a layout file.
A CSV file with a field-name header.	Configure the CSV Input tool using analysis.
None of the above.	Configure the CSV Input manually.

You can use wildcards to configure a single CSV Input tool to read a sequence of files with the same layout and format.

...using a repository schema

To use this method of configuring a CSV Input tool, you must have a CSV schema for the input file already defined in the repository.

To configure the CSV Input tool using a repository schema:

Select the CSV Input tool.
Go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Select Use repository schema and select the schema from the drop-down list.
Examine the field grid and the Sample records grid to verify that the format is correct and the sample data is correctly aligned.
Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
- If you don't want to process the entire file, select Limit records and type the desired number of records to process.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
- Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...using a layout file

To use this method of configuring a CSV Input tool, you must have a multi-column layout file for the input file.

To configure the CSV Input tool using a layout file:

Select the CSV Input tool.
Go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Specify the field Separator.
Select and then select Layout and choose Load (text), Load (text with types), or Load (XML).
Browse to the layout file that describes the input file, and then select Open.
Examine the fields grid and Sample records grid to verify that the format is correct and the sample data is correctly aligned. If the fields and sample data appear incorrect, choose or create a different layout file appropriate for the input file, or use another method to configure the CSV Input tool.
Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
- If you don't want to process the entire file, select Limit records and type the desired number of records to process.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
- Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...using analysis

Use this method to configure a CSV Input tool if the input file has a header line, but no appropriate schema or layout file.

If the input file is changed after you configure the CSV Input tool using analysis, you must reanalyze the file.

To configure the CSV Input tool using analysis:

Select the CSV Input tool.
Go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Select Analyze and analyze the file.
Examine the fields. If the field names are of the form FIELD1, FIELD2, FIELD3 but your CSV file has a header, edit the CSV file header to remove spaces and special characters from the field names and then analyze it again. If you cannot get the Analyzer to scan correctly, contact support or enter the field names manually and set Skip Lines to 1.
Review the field types in the field grid. If a type is incorrect, select the cell in the Type column, and specify the field's data type and size. You can also use the Change selected types and Change all types commands on the Schemamenu to edit field types.

You don't need to set the data type unless you will perform calculations or comparisons using the field, or intend to export the information to a strongly-typed database.

Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
- If you don't want to process the entire file, select Limit records and type the desired number of records to process.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
- Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...manually

Use this method to configure a CSV Input tool if the other available configuration options do not correctly describe your input data.

To configure the CSV Input tool manually:

Select the CSV Input tool.
Go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Analyze the file:
- If your file has standard encoding (ASCII or Latin-1), select Analyze and analyze the file. Let the analyzer run as long as you can; the longer it runs, the better the results will be.
- If your file contains an encoding other than ASCII or Latin-1, select the correct code page from the Code page list box and then select Analyze and analyze the file.
- If you don't know the encoding, select Analyze and analyze the file. Data Management will "guess" the code page for you. If the code page is correctly specified, the sample data will display correctly after analysis.
Examine the fields. They may be of the form FIELD1, FIELD2, FIELD3. For each field listed:
- Select the cell in the Field name column, and type the desired name.
- Select the cell in the Type column, and specify the field's data type and size, if different than the name assigned by default. Available options depend on the type selected. (See Configuring data types.) If a type is incorrect, select the cell in the Type column, and specify the field's data type and size. You can also use the Change selected types and Change all types commands on the Schemamenu to edit field types. You don't need to set the data type unless you will perform calculations or comparisons using the field, or intend to export the information to a strongly-typed database.
- After you make changes, select Refresh sample to see the effect on the sample records displayed.
- You can configure multiple fields at once. Hold the Shift key and select the first and last fields to select a range of fields. On each line change the field name to the correct name.
Optionally, go to the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
- If you don't want to process the entire file, select Limit records and type the desired number of records to process.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
- Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...for flexibility

The Flexibility option on the CSV Input tool allows the tool to read files that do not exactly match the schema, provided that the file has a header row.

To configure the CSV Input tool for flexibility:

Go to the Flexibility tab on the Properties pane.
Select Enable flexibility.
Select a New field action to specify how fields not present in the schema are handled.
- Choose Ignore if the file is allowed to have new fields that are undefined in the schema, and to ignore them if found.
- Choose Error if the input file is not allowed to have fields undefined in the schema.
For each Field displayed in the grid, you may specify the action to take when that field is Missing from the CSV file header.
- Choose Null value to set the field value to null for all records.
- Choose Default value to sets the field value to the default value for all records.
- Choose Fail to issue an error and stop.
For each Field displayed in the grid, you may specify one or more Aliases as a comma-separated list of allowed alternate field names for field.

CSV Output

The CSV Output tool writes data into CSV format. You can specify output file options for maximum portability across applications and operating systems. CSV output does not encode field types. Use this tool when you want to create a portable data file, and the system or application that will read the file doesn't accept a more strongly-typed file format (such as DBF). You may optionally output the data as a gzip-compressed file with a .gz or .gzip extension.

CSV Output tool configuration parameters

The CSV Output tool has one set of configuration parameters in addition to the standard execution options.

Parameter	Description
Compression type	Optional compression format used for output file.
Output file	The output file name.
Open file time	Specifies when the output file will be opened: Default Uses the site/execution server setting When project is started When first record is read When last record is read
Write empty file if no records are read	If selected, writes an output file even when no records are read. This is unavailable if Open file time is When project is started.
Layout Save layout Save schema	If selected, saves the output as the specified layout file or repository schema.
Code page	Name of the encoding of the source data. This is optional and defaults to Latin-1. See About code pages.
Output BOM (preamble)	If selected, a Byte Order Marker will be output for code pages starting with UTF.
Separator	The field delimiter. Can be specified by String or Code. This is optional and defaults to a comma.
Quote character	The character used to enclose strings that would otherwise be ambiguous. This is optional and defaults to a double-quote character.
Quote condition	The conditions under which the quote character is applied. Must be one of: Automatic (as needed) Default setting Always (text only) Always (everything) Never
Quote missing fields	If selected, Missing fields have quotes applied according to the specified Quote condition.
Write a field name header	Field names will be output as the first line of the CSV file. This is optional and enabled by default.
End of line	The character sequence identifying the line termination. Must be one of: Cr Lf (Windows) Default setting Lf (UNIX) Cr (Mac)
Compression type	If gzip is selected, writes output as a gzip-compressed file.
Split files	If selected, splits the output file by Size, Record count, or Data.
Split size (MB)	If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb).
Split count	If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000.
Split field	If Split files by data is selected, name of the field to be used to split the data. A separate file will be created for each unique value of the specified field. Data must be grouped by the split field.
Suppress split field	If selected, the Split field is omitted from files created using Split files by data.
Treat output file as folder	If selected, the Split field value is used as file name for files created using Split files by data.
Write header on split files	If selected, writes a field name header to each file created using Split files by size, Split size by record count, or Split files by data.
Append to existing file	If selected, appends the new output to existing output file rather than overwriting it. If Write a field name header is also selected, the tool writes a field name header only when the output file does not already exist or is empty. The tool does not check to see if the field name header in an existing file is compatible with the fields being written. Note that the Append option is ignored if you specify Split files options after selecting Append to existing file. Append operations are not supported by Amazon S3 and WASB! Do not enable this option if you are writing data to either of these systems.
Replication factor	Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1.
Block size (MB)	The minimum size of a file division. The default is 128 MB.

Configure the CSV Output tool

Select the CSV Output tool.
Go to the Configuration tab on the Properties pane.
Optionally, specify the Compression type.
Specify the Output file.
Optionally, generate a layout file or repository schema: select Layout and choose Save layout (Text), Save layout (Text with type), Save layout (XML), or Save as repository schema.
Optionally, specify Open file time.

Option	Description
Default	Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting.
When project is started	Open output file when the project is run.
When the first record is read	Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read.
After the last record is read	Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read.

Optionally, select Write empty file if no records are read. This is unavailable if Open file time is When project is started.
Optionally, select Append to existing file to append the new output to the existing output file rather than overwriting it.

Append is not supported by Amazon S3 and WASB. Do not enable this option if you are writing data to either of these systems.

Specify Options:
- If the upstream connection contains Unicode columns, and you need to represent all characters in the output file, specify a Code page other than the default Latin-1 (ISO 8859-1). The code page you choose will depend on the intended consumer of the output.
- Optionally, select Output BOM to write a Byte Order Marker at the start of the output. This is a Unicode character used to signal the endianness (byte order) of a text file or stream.
- Specify the field Separator, if other than the default comma.
- Specify the Quote character, if other than the default double-quote character.
- Specify the Quote condition. Choices are:
  - Automatic (as needed)
  - Always (text only)
  - Always (everything)
  - Never
- Select Quote Missing fields to enclose blank fields in the quote character.
- Select Write a field name header to write field names to the first line of the CSV file.
- Select the End of line sequence to match the operating system that will read the resulting file. Choices are:
  - Cr Lf (Windows
  - Lf (UNIX)
  - Cr (Mac)
- Select Compression type and select gzip to write the output as a gzip-compressed file. You cannot use the Append to existing file option if you enable compression.
Optionally, you can split the output file into smaller, more manageable pieces. Check the Split files box, and then select By size or By data.
- If you select Split files by size, select Split size and then specify the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form output_file01234.csv.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form: F:\output_directory.
If you do not want the specified field to appear in the output, select Suppress split field on output.
If you select Split files as well as Write field name header, you can optionally select Write header on split files. If you are using Data Management to read split files back in, check both Write field name header line and Write header on split files, because Data Management reads the header line of each CSV file independently. However, if you intend to concatenate the output files before using them as input to another application, do not check Write header on split files.
Optionally, go to the Execution tab and Enable trigger output, configure reporting options, or set Web service options.

The Split into several files option is useful when you want to write a file to media (portable disk drive, CD-ROM, DVD) for physical transport to someone else and the raw file is too large for a single disk. Set the maximum file size to equal the capacity of the target media.

You can configure a single CSV Input tool to read a sequence of files with the same layout and format. This is an easy way to reassemble a file that's been split as described above. Since Data Management always reads wildcard input files in alphabetical file name order, they are guaranteed to be reassembled in order.

Consider using the Extract Layout tool to create "layout" files describing the layout of the CSV file. These layout files are directly readable by Data Management using the Import Layout feature, and can also assist others in reading the files.