CSV Direct/Input/Output
CSV Direct
The CSV Direct tool behaves almost identically to the CSV Input tool except that instead of reading data from a file, it reads from data that is stored in the project itself. Therefore, all of the options defined for the CSV Input tool also apply to the CSV Direct tool, and they are configured identically except for the input source.
CSV Direct tool configuration parameters
The CSV Direct tool has two sets of configuration parameters in addition to the standard execution options:
Configuration
Data | CSV-formatted text. You are limited to 32k of data |
Use repository schema | If selected, configure the field layout using the specified Schema instead of configuring fields directly. |
Schema | If Use repository schema is selected, a schema must be specified. |
Separator | The field delimiter. Can be specified by Character or Code. This is optional and defaults to a comma. |
Quote character | If defined, one or more characters that quote or escape the data. If no quote character is specified, there is no escapement. If one of {[ or < are specified, the corresponding closing bracket is treated as the closing quote. |
Skip lines | If defined, skip leading lines in the file, such as field names contained in a first line. This is optional and defaults to 0. This is not intended for skipping data - use the Last N tool for this. |
Options
Allow newlines in fields | If selected, files with embedded newlines in the fields are accepted as input and any newlines encountered are ignored. |
Trim | If selected, leading/trailing whitespace is removed from input data. |
Limit records | If selected, limits the number of records read. |
Read only the first | If Limit records is selected, specifies the number of records to be read. |
Error handling | Configures how errors in the input data will be handled:
|
Parsing errors | Specify how to handle input containing invalid XML that cannot be processed. The Issue warning option is not available for this type of error. |
Conversion errors | Specify how to handle input data that cannot be correctly converted to its specified data type (such as converting "Jane Smith" to a Date). |
Truncation errors | Specify how to handle input with truncated field values (such as storing "John Smithson" in a Unicode(1) field). |
Missing fields | Specify how to handle input with fields that are missing or contain empty values. |
Extra fields | Specify how to handle input with extra elements that are not defined in the tool's configuration. |
Send errors to Message Viewer | If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer. Note that the Report warnings option must be selected on the Execution tab. If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer, even if the Report warnings option on the Execution tab is not selected. |
Send errors to Error Report connector | If selected, creates an E "Error Report" output connector. All error types configured to either Issue warning or Reject record will send detailed error messages to that connector. Attach a downstream Data Viewer or other output tool to review these messages. |
Configure the CSV Direct tool
Select the CSV Direct tool, and then go to the Configuration tab on the Properties pane.
Enter CSV-formatted text directly into the edit box. You are limited to 32k of data; for larger volumes you should use an input file or repository data object.
Configure properties as for a CSV Input tool.
CSV Input
The CSV Input tool reads a delimited text file (including gzip-compressed CSV files with .gzip and .gz extensions), and outputs records from the file to the downstream tools. You can configure the number of fields, and the size and type of each field. To automate field definition, you can specify a layout file or schema containing field names, types, and sizes. The CSV Input tool supports any delimiter (tabs or semicolons, for example), not just commas. In addition to specifying the fields and the delimiter, you can set options to accommodate common variations in this format. Some examples are:
Field names on the first line of the file.
"Short" records that truncate trailing fields.
Missing fields in some records.
Newline characters in some fields.
Text encoding other than ASCII or Latin-1.
Files that do not exactly match the expected schema.
You should always check the Report warnings option on the Execution tab of the CSV Input tool so that you will be advised of short records, truncated fields, and other problems common to the CSV format.
CSV Input tool configuration parameters
The CSV Input tool has three sets of configuration parameters in addition to the standard execution options:
Configuration
Compression type | Compression strategy to use for input file. Options are:
If not otherwise specified, Data Management may attempt to autodetect the compression format upon Analyze. |
Input file | The file containing the records, or a wildcard pattern matching multiple files. |
Use repository schema | If selected, configure the field layout using the specified Schema instead of configuring fields directly. |
Schema | If Use repository schema is selected, a schema must be specified. |
Code page | Name of the encoding of the source data. This is optional and defaults to Latin-1. See About code pages. |
Separator | The field delimiter. May be specified by String or Code. This is optional and defaults to a comma. |
Quote characters | If defined, one or more characters that quote or escape the data. If no quote character is specified, there is no escapement. If one of {[ or < are specified, the corresponding closing bracket is treated as the closing quote. |
Allow newlines in fields | If selected, files with embedded newlines in the fields are accepted as input and any newlines encountered are ignored. |
Skip lines | If defined, skip leading lines in the file, such as field names contained in a first line. This is optional and defaults to 0. |
Trim | If selected, leading/trailing whitespace is removed from input data. |
Flexibility
Enable flexibility | If selected, allows the tool to read files that do not exactly match the schema, provided the file has a header row. |
New field action | If Flexibility is enabled, specifies how fields not present in the schema are handled. Choose Ignore if the file is allowed to have new fields that were not defined in the schema, and to ignore them if found. Choose Error if the input file is not allowed to have fields not defined in the schema. |
Missing | If Flexibility is enabled, indicates the action to take for each field when that field is not present in the CSV file header. Options are:
|
Aliases | If Flexibility is enabled, a string containing a comma-separated list of allowed alternate field names for field. |
Produce file name field | If selected, the file name will be output as a record field. |
Output full path | If Produce file name field is selected, optionally outputs the entire path to the record file name field. |
Output URI path | If Output full path is selected, express path as a Uniform Resource Identifier (URI). |
Field name | If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to |
Field size | If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to |
Options
Limit records | If selected, limits the number of records read. |
Read only the first | If Limit records is selected, specifies the number of records to be read. |
Produce file name field | If selected, the file name will be output as a record field. |
Output full path | If Produce file name field is selected, optionally outputs the entire path to the record file name field. |
Output URI path | If Output full path is selected, express path as a Uniform Resource Identifier (URI). |
Field name | If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to |
Field size | If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to |
Error handling | Configures how errors in the input data will be handled:
|
Parsing errors | Specify how to handle input containing invalid XML that cannot be processed. The Issue warning option is not available for this type of error. |
Conversion errors | Specify how to handle input data that cannot be correctly converted to its specified data type (such as converting "Jane Smith" to a Date). |
Truncation errors | Specify how to handle input with truncated field values (such as storing "John Smithson" in a Unicode(1) field). |
Missing fields | Specify how to handle input with fields that are missing or contain empty values. |
Extra fields | Specify how to handle input with extra elements that are not defined in the tool's configuration. |
Send errors to Message Viewer | If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer. Note that the Report warnings option must be selected on the Execution tab. If selected, all error types configured to either Issue warning or Reject record will send warnings to the Message Viewer, even if the Report warnings option on the Execution tab is not selected. |
Send errors to Error Report connector | If selected, creates an E "Error Report" output connector. All error types configured to either Issue warning or Reject record will send detailed error messages to that connector. Attach a downstream Data Viewer or other output tool to review these messages. |
Configure the CSV Input tool
To configure a CSV Input tool, you need the following information about the input file:
The input file name
The field delimiter character
The quote character(s), if any
The list of fields and their types
Data Management provides several ways to specify this information:
If you have... | Do this: |
---|---|
A schema for this file already defined in the repository. | Configure the CSV Input tool using a repository schema. |
A layout file. | Configure the CSV Input tool using a layout file. |
A CSV file with a field-name header. | Configure the CSV Input tool using analysis. |
None of the above. | Configure the CSV Input manually. |
You can use wildcards to configure a single CSV Input tool to read a sequence of files with the same layout and format.
...using a repository schema
To use this method of configuring a CSV Input tool, you must have a CSV schema for the input file already defined in the repository.
To configure the CSV Input tool using a repository schema:
Select the CSV Input tool, and then go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Select Use repository schema and select the schema from the drop-down list.
Examine the field grid and the Sample records grid to verify that the format is correct and the sample data is correctly aligned.
Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
If you don't want to process the entire file, select Limit records and type the desired number of records to process.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...using a layout file
To use this method of configuring a CSV Input tool, you must have a multi-column layout file for the input file.
To configure the CSV Input tool using a layout file:
Select the CSV Input tool, and then go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Specify the field Separator.
Select and then select Layout and choose Load (text), Load (text with types), or Load (XML).
Browse to the layout file that describes the input file, and then select Open.
Examine the fields grid and Sample records grid to verify that the format is correct and the sample data is correctly aligned. If the fields and sample data appear incorrect, choose or create a different layout file appropriate for the input file, or use another method to configure the CSV Input tool.
Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
If you don't want to process the entire file, select Limit records and type the desired number of records to process.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...using analysis
Use this method to configure a CSV Input tool if the input file has a header line, but no appropriate schema or layout file.
If the input file is changed after you configure the CSV Input tool using analysis, you must reanalyze the file.
To configure the CSV Input tool using analysis:
Select the CSV Input tool, and then go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Select Analyze and analyze the file.
Examine the fields. If the field names are of the form
FIELD1
,FIELD2
,FIELD3
but your CSV file has a header, edit the CSV file header to remove spaces and special characters from the field names and then analyze it again. If you cannot get the Analyzer to scan correctly, contact support or enter the field names manually and set Skip Lines to 1.Review the field types in the field grid. If a type is incorrect, select the cell in the Type column, and specify the field's data type and size. You can also use the Change selected types and Change all types commands on the Schemamenu to edit field types. You don't need to set the data type unless you will perform calculations or comparisons using the field, or intend to export the information to a strongly-typed database.
Optionally, select the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
If you don't want to process the entire file, select Limit records and type the desired number of records to process.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...manually
Use this method to configure a CSV Input tool if the other available configuration options do not correctly describe your input data.
To configure the CSV Input tool manually:
Select the CSV Input tool, and then go to the Configuration tab on the Properties pane.
Specify the Compression type.
Specify the Input file.
Analyze the file:
If your file has standard encoding (ASCII or Latin-1), select Analyze and analyze the file. Let the analyzer run as long as you can; the longer it runs, the better the results will be.
If your file contains an encoding other than ASCII or Latin-1, select the correct code page from the Code page list box and then select Analyze and analyze the file.
If you don't know the encoding, select Analyze and analyze the file. Data Management will "guess" the code page for you. If the code page is correctly specified, the sample data will display correctly after analysis.
Examine the fields. They may be of the form
FIELD1
,FIELD2
,FIELD3
. For each field listed:Select the cell in the Field name column, and type the desired name.
Select the cell in the Type column, and specify the field's data type and size, if different than the name assigned by default. Available options depend on the type selected. (See Configuring data types.) If a type is incorrect, select the cell in the Type column, and specify the field's data type and size. You can also use the Change selected types and Change all types commands on the Schemamenu to edit field types. You don't need to set the data type unless you will perform calculations or comparisons using the field, or intend to export the information to a strongly-typed database.
After you make changes, select Refresh sample to see the effect on the sample records displayed.
You can configure multiple fields at once. Hold the Shift key and select the first and last fields to select a range of fields. On each line change the field name to the correct name.
Optionally, go to the Flexibility tab and specify how missing and re-ordered fields are handled.
Optionally, go to the Options tab and configure advanced options:
If you don't want to process the entire file, select Limit records and type the desired number of records to process.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Configure error handling.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...for flexibility
The Flexibility option on the CSV Input tool allows the tool to read files that do not exactly match the schema, provided that the file has a header row.
To configure the CSV Input tool for flexibility:
Go to the Flexibility tab on the Properties pane, and then select Enable flexibility.
Select a New field action to specify how fields not present in the schema are handled.
Choose Ignore if the file is allowed to have new fields that are undefined in the schema, and to ignore them if found.
Choose Error if the input file is not allowed to have fields undefined in the schema.
For each Field displayed in the grid, you may specify the action to take when that field is Missing from the CSV file header.
Choose Null value to set the field value to null for all records.
Choose Default value to sets the field value to the default value for all records.
Choose Fail to issue an error and stop.
For each Field displayed in the grid, you may specify one or more Aliases as a comma-separated list of allowed alternate field names for field.
CSV Output
The CSV Output tool writes data into CSV format. You can specify output file options for maximum portability across applications and operating systems. CSV output does not encode field types. Use this tool when you want to create a portable data file, and the system or application that will read the file doesn't accept a more strongly-typed file format (such as DBF). You may optionally output the data as a gzip-compressed file with a .gz or .gzip extension.
CSV Output tool configuration parameters
The CSV Output tool has one set of configuration parameters in addition to the standard execution options:
Compression type | Optional compression format used for output file. |
Output file | The output file name. |
Open file time | Specifies when the output file will be opened:
|
Write empty file if no records are read | If selected, writes an output file even when no records are read. This is unavailable if Open file time is When project is started. |
Layout | If selected, saves the output as the specified layout file or repository schema. |
Code page | Name of the encoding of the source data. This is optional and defaults to Latin-1. See About code pages. |
Output BOM (preamble) | If selected, a Byte Order Marker will be output for code pages starting with UTF. |
Separator | The field delimiter. Can be specified by String or Code. This is optional and defaults to a comma. |
Quote character | The character used to enclose strings that would otherwise be ambiguous. This is optional and defaults to a double-quote character. |
Quote condition | The conditions under which the quote character is applied. Must be one of:
|
Quote missing fields | If selected, Missing fields have quotes applied according to the specified Quote condition. |
Write a field name header | Field names will be output as the first line of the CSV file. This is optional and enabled by default. |
End of line | The character sequence identifying the line termination. Must be one of:
|
Compression type | If gzip is selected, writes output as a gzip-compressed file. |
Split files | If selected, splits the output file by Size, Record count, or Data. |
Split size (MB) | If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb). |
Split count | If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000. |
Split field | If Split files by data is selected, name of the field to be used to split the data. A separate file will be created for each unique value of the specified field. Data must be grouped by the split field. |
Suppress split field | If selected, the Split field is omitted from files created using Split files by data. |
Treat output file as folder | If selected, the Split field value is used as file name for files created using Split files by data. |
Write header on split files | If selected, writes a field name header to each file created using Split files by size, Split size by record count, or Split files by data. |
Append to existing file | If selected, appends the new output to existing output file rather than overwriting it. If Write a field name header is also selected, the tool writes a field name header only when the output file does not already exist or is empty. The tool does not check to see if the field name header in an existing file is compatible with the fields being written. Note that the Append option is ignored if you specify Split files options after selecting Append to existing file. Append operations are not supported by Amazon S3 and WASB! Do not enable this option if you are writing data to either of these systems. |
Replication factor | Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1. |
Block size (MB) | The minimum size of a file division. The default is 128 MB. |
Configure the CSV Output tool
Select the CSV Output tool, and then go to the Configuration tab on the Properties pane.
Optionally, specify the Compression type.
Specify the Output file.
Optionally, generate a layout file or repository schema: select Layout and choose Save layout (Text), Save layout (Text with type), Save layout (XML), or Save as repository schema.
Optionally, specify Open file time:
Default | Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting. |
When project is started | Open output file when the project is run. |
When the first record is read | Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read. |
After the last record is read | Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read. |
Optionally, select Write empty file if no records are read. This is unavailable if Open file time is When project is started.
Optionally, select Append to existing file to append the new output to the existing output file rather than overwriting it. (Note that append is not supported by Amazon S3 and WASB. Do not enable this option if you are writing data to either of these systems.)
Specify Options:
If the upstream connection contains Unicode columns, and you need to represent all characters in the output file, specify a Code page other than the default Latin-1 (ISO 8859-1). The code page you choose will depend on the intended consumer of the output.
Optionally, select Output BOM to write a Byte Order Marker at the start of the output. This is a Unicode character used to signal the endianness (byte order) of a text file or stream.
Specify the field Separator, if other than the default comma.
Specify the Quote character, if other than the default double-quote character.
Specify the Quote condition. Choices are:
Automatic (as needed)
Always (text only)
Always (everything)
Never
Select Quote Missing fields to enclose blank fields in the quote character.
Select Write a field name header to write field names to the first line of the CSV file.
Select the End of line sequence to match the operating system that will read the resulting file. Choices are:
Cr Lf (Windows
Lf (UNIX)
Cr (Mac)
Select Compression type and select gzip to write the output as a gzip-compressed file. You cannot use the Append to existing file option if you enable compression.
Optionally, you can split the output file into smaller, more manageable pieces. Check the Split files box, and then select By size or By data.
If you select Split files by size, select Split size and then specify the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form
output_file01234.csv
.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form:
F:\output_directory
.If you do not want the specified field to appear in the output, select Suppress split field on output.
If you select Split files as well as Write field name header, you can optionally select Write header on split files. If you are using Data Management to read split files back in, check both Write field name header line and Write header on split files, because Data Management reads the header line of each CSV file independently. However, if you intend to concatenate the output files before using them as input to another application, do not check Write header on split files.
Optionally, go to the Execution tab and Enable trigger output, configure reporting options, or set Web service options.
The Split into several files option is useful when you want to write a file to media (portable disk drive, CD-ROM, DVD) for physical transport to someone else and the raw file is too large for a single disk. Set the maximum file size to equal the capacity of the target media.
You can configure a single CSV Input tool to read a sequence of files with the same layout and format. This is an easy way to reassemble a file that's been split as described above. Since Data Management always reads wildcard input files in alphabetical file name order, they are guaranteed to be reassembled in order.
Consider using the Extract Layout tool to create "layout" files describing the layout of the CSV file. These layout files are directly readable by Data Management using the Import Layout feature, and can also assist others in reading the files.