Avro tools

Overview

Avro is a row-oriented data serialization and data exchange framework developed. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It relies on schemas (defined in JSON format) to structure the encoded data. Avro schemas may be embedded in the corresponding data, specified as a schema file, or referenced as a Kafka Schema Registry URL.

Avro Input

The Avro Input tool reads data from the Avro data format.

Configuration parameters

The Avro Input tool has multiple sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Input from	The source of the data: File: read from the specified data file, or a wildcard pattern matching multiple files. Field: read from the field on the input connection.
Input file	If Input from is File, the file containing the records, or a wildcard pattern matching multiple files.
Input field	If Input from is Field, the field from which the data will be read.
Schema source	Source of the data schema that defines the Avro input: Embedded in data: input is in Avro file format, with schema header. Specified: on Avro schema tab, enter schema directly or browse to schema file (.avsc, .avro). Kafka schema registry: input is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Kafka subject	If Schema source is Kafka schema registry, the Kafka subject name associated with the input. The configured Kafka subject will be used to determine the expected schema of the Avro packets at tool configuration time, and this is the schema that is presented to downstream tools. However, each Avro packet will contain its own subject identifier as well. If a packet's subject does not match the configured subject, Data Management will map the packet's fields onto the configured fields, discarding extra fields and setting missing fields to null.

Options

Parameter	Description
Do not convert Timestamps to local	If selected, DateTime fields are not converted from UTC to the server-local timezone.
Limit records	If selected, limits the number of records read.
Read only the first	If Limit records is selected, specifies the number of records to be read.
Produce file name field	If selected, the file name will be output as a record field.
Output full path	If Produce file name field is selected, optionally outputs the entire path to the record file name field.
Output URI path	If Output full path is selected, express path as a Uniform Resource Identifier (URI).
Field name	If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to FILENAME.
Field size	If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to 255.

Avro schema

Parameter	Description
Avro schema	If Schema source is Specified, the schema describing the Avro input data. You can enter the schema directly, or browse for an .avro or .avsc file to insert the JSON representation of the referenced Avro schema.

Configure the Avro Input tool

The procedure for configuring an Avro Input tool depends on the data source.

If the data source is...	Do this
A file or files	Configure the Avro Input tool to read files
A field or datastream	Configure the Avro Input tool to read fields

...to read files

To configure the Avro Input tool:

Select the Avro Input tool.
Go to the Configuration tab.
Select Input from and choose File, and then specify the input file or files. You can use wildcards to configure a single Avro Input tool to read a sequence of files with the same schema.
Specify the Input file.
Select the Schema source that describes the input data.

If you have...	Do this
Schema header embedded in input data	Select Embedded in data.
Serialized Avro bytes with no header	Select Specified, and then go to the Avro schema tab and either enter the schema directly or select and browse to the schema file (.avsc, .avro).
Serialized Avro bytes with a Kafka header containing the schema ID	Select Kafka schema registry. The Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.

If Schema source is Kafka schema registry, specify the Kafka subject.
Once you specify a file and schema, Data Management displays field definitions and record layout on the Configuration tab. Select Analyze to refresh the display.
Optionally, select the Options tab and configure advanced options:
- If Do not convert Timestamps to local is selected, DateTime fields will not be converted from UTC to the server-local timezone.
- If you don't want to process the entire file, select Limit records and type the desired number of records to read.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification. This can be useful when reading a wildcarded set of files. Select Output URI path to express the complete file specification as a Uniform Resource Identifier.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...to read fields

To configure the Avro Input tool to read fields:

Select the Avro Input tool.
Go to the Configuration tab on the Properties pane.
Select Input from and choose Field.
Select Commit to display the input connector, and then connect the desired input tool.
Select the Input field from which the data will be read, and then define the input schema.

If you have...	Do this
Serialized Avro bytes with no header	Select Specified, and then go to the Avro schema tab and either enter the schema directly or select and browse to the schema file (.avsc, .avro).
Serialized Avro bytes with a Kafka header containing the schema ID	Select Kafka schema registry. The Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.

If Schema source is Kafka schema registry, specify the Kafka subject.
Once you specify an input field and schema, Data Management displays field definitions and record layout on the Configuration tab. Select Analyze to refresh the display.
Optionally, select the Options tab and select Do not convert Timestamps to local.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

Avro Output

The Avro Output tool writes data to the Avro data format.

Configuration parameters

The Avro Output tool has a single set of configuration parameters in addition to the standard execution options.

Parameter	Description
Output to Output file Output field	The destination of the data: File: write to the specified Output file, or if Split files is selected, file names derived from the specified file. This is the default. Field: write to the specified Output field.
Open file time	If Output to is Output file, specifies when the output file will be opened: Default (uses the site/execution server setting), When project is started, When first record is read, or When last record is read.
Include input fields	If Output to is Field, optionally pass through the "extra" input fields, making it easy to pack up an Avro payload in a field, while keeping an identifier field attached for downstream processing.
Write empty file if no records are read	If selected, writes an output file even when no records are read. This is unavailable if Open file time is When project is started.
Do not convert DateTime to UTC	If selected, Data Management DateTime fields are not converted from the server-local timezone to UTC before being written. Note that Data Management DateTime values are timezone free, meaningful only when interpreted in the context of an implied timezone.
Schema output	Format for the data schema that describes the Avro output: Embedded in data: output is in Avro file format, with schema header. None: serialized Avro bytes with no header. Kafka schema registry: output is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Kafka subject	If Schema output is Kafka schema registry, the Kafka subject name associated with the output.
Save schema to .avsc	If configured, the Avro schema file (*.avsc) describing the output.
Split files Split records	If selected, splits the output By size, By Record count, or By data.
Split size (MB)	If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb). Note that the Avro file format has an internal block structure that may prevent file splits from occurring on files smaller than some minimal size. If Split records by size is selected, specifies Split size as the maximum payload size (in megabytes).
Split count	If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000. If Split records by record count is selected, specifies Split count as maximum number of records in payload.
Split field	If Split files by data is selected, name of the field to be used to split the data. A separate file or payload will be created for each unique value of the specified field. Data must be grouped by the split field.
Suppress split field	If selected, the Split field is omitted from files created using Split files by data.
Treat output file as folder	To generate file names where the entire name is determined by the value of the split field, select Treat output file as folder and specify the output directory in the Output file box, using the form: `F:\output_directory`.
Compression codec	The compression codec used to compress blocks. While the Avro input and output tools support multiple compression types, we recommend Snappy for most purposes, as it is fast and stable. None: passes through data uncompressed. Deflate: writes the data block using the deflate algorithm as specified in RFC 1951. This creates smaller files but is slower. Snappy: (default) writes the data block using Google's Snappy compression library. Each compressed block is followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in the block. LZO, XZ, and BZip2: write the data block using those respective data compression protocols. Bzip2: provides high compression ratios and low performance. XZ: provides high levels of compression. Additionally, using XZ compression requires setting Java JVM memory to 500MB or more.
Bytes per block	The number of bytes to write before flushing a block, with a range of 1000-10,000,000. Higher numbers tend to produce slightly better performance and compression ratios, at the expense of random-access performance for readers. Defaults to 100,000.
Replication factor	Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1.
Block size (MB)	The minimum size of a file division. The default is 128 MB.

Configure the Avro Output tool

The procedure for configuring an Avro Output tool depends on the data target.

If the data target is...	Do this
A file or files	Configure the Avro Output tool to write files
A field or datastream	Configure the Avro Output tool to write fields

...to write files

To configure the Avro Output tool to write files:

Select the Avro Output tool.
go to the Configuration tab on the Properties pane.
Select Output to and choose File, and then specify the output file.
Optionally, specify Open file time.

Option	Description
Default	Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting.
When project is started	Open output file when the project is run.
When the first record is read	Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read.
After the last record is read	Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read.

Optionally, select Do not convert DateTime to UTC to retain the server-local timezone format.
Select the Schema output to define the Avro output:
- Embedded in data: output is in Avro file format, with schema header.
- None: serialized Avro bytes with no header.
- Kafka schema registry: output is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Optionally, select Save schema to .avsc to configure an Avro schema file (*.avsc) describing the output.
Optionally, you can check the Split files box, and then select By size, By record count, or By data to split the output file into smaller, more manageable pieces.
- If you select Split files by size, specify Split size as the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form output_file01234.avro.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form: F:\output_directory.
If you do not want the specified field to appear in the output, select Suppress split field.
Optionally, select the Compression codec and adjust the Bytes per block.
- None: passes through data uncompressed.
- Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
- Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
- LZO, XZ, and BZip2: write the data block using those respective data compression protocols.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

...to write fields

To configure the Avro Output tool to write fields:

Select the Avro Output2 tool.
Go to the Configuration tab on the Properties pane.
Select Output to and choose Field, and then enter an Output field name.
Optionally, select Do not convert DateTime to UTC to retain the server-local timezone format.
Select the Schema output to define the Avro output:
- Embedded in data: output is in Avro file format, with schema header.
- None: serialized Avro bytes with no header.
- Kafka schema registry: output is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Optionally, select Save schema to .avsc to configure an Avro schema file (*.avsc) describing the output.
Optionally, you can select Split records, and then choose By size, By record count, or By data to split the output into smaller, more manageable packets to be sent to a web service or put in a queue.
- If you select Split records by size, specify Split size as the maximum payload size (in megabytes).
- If you select Split records by record count, specify Split count as maximum number of records.
- If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field.
Optionally, select the Compression codec and adjust the Bytes per block.
- None: passes through data uncompressed.
- Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
- Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
- LZO, XZ, and BZip2: write the data block using those respective data compression protocols.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.