Avro tools
Overview
Avro is a row-oriented data serialization and data exchange framework developed. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It relies on schemas (defined in JSON format) to structure the encoded data. Avro schemas may be embedded in the corresponding data, specified as a schema file, or referenced as a Kafka Schema Registry URL.
Avro Input
The Avro Input tool reads data from the Avro data format.
Configuration parameters
The Avro Input tool has multiple sets of configuration parameters in addition to the standard execution options.
Configuration
Parameter | Description |
---|---|
Input from | The source of the data:
|
Input file | If Input from is File, the file containing the records, or a wildcard pattern matching multiple files. |
Input field | If Input from is Field, the field from which the data will be read. |
Schema source | Source of the data schema that defines the Avro input:
|
Kafka subject | If Schema source is Kafka schema registry, the Kafka subject name associated with the input. The configured Kafka subject will be used to determine the expected schema of the Avro packets at tool configuration time, and this is the schema that is presented to downstream tools. However, each Avro packet will contain its own subject identifier as well. If a packet's subject does not match the configured subject, Data Management will map the packet's fields onto the configured fields, discarding extra fields and setting missing fields to null. |
Options
Parameter | Description |
---|---|
Do not convert Timestamps to local | If selected, DateTime fields are not converted from UTC to the server-local timezone. |
Limit records | If selected, limits the number of records read. |
Read only the first | If Limit records is selected, specifies the number of records to be read. |
Produce file name field | If selected, the file name will be output as a record field. |
Output full path | If Produce file name field is selected, optionally outputs the entire path to the record file name field. |
Output URI path | If Output full path is selected, express path as a Uniform Resource Identifier (URI). |
Field name | If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to FILENAME. |
Field size | If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to 255. |
Avro schema
Parameter | Description |
---|---|
Avro schema | If Schema source is Specified, the schema describing the Avro input data. You can enter the schema directly, or browse for an *.avro or *.avsc file to insert the JSON representation of the referenced Avro schema. |
Configure the Avro Input tool
The procedure for configuring an Avro Input tool depends on the data source.
If the data source is... | Do this |
---|---|
A file or files | |
A field or datastream |
...to read files
To configure the Avro Input tool:
Select the Avro Input tool.
Go to the Configuration tab.
Select Input from and choose File, and then specify the input file or files. You can use wildcards to configure a single Avro Input tool to read a sequence of files with the same schema.
Specify the Input file.
Select the Schema source that describes the input data.
If you have... | Do this |
---|---|
Schema header embedded in input data | Select Embedded in data. |
Serialized Avro bytes with no header | Select Specified, and then go to the Avro schema tab and either enter the schema directly or select and browse to the schema file (*.avsc, *.avro). |
Serialized Avro bytes with a Kafka header containing the schema ID | Select Kafka schema registry. The Kafka schema Registry URL must be configured in on the Resource tab in Site Settings. |
If Schema source is Kafka schema registry, specify the Kafka subject.
Once you specify a file and schema, Data Management displays field definitions and record layout on the Configuration tab. Select Analyze to refresh the display.
Optionally, select the Options tab and configure advanced options:
If Do not convert Timestamps to local is selected, DateTime fields will not be converted from UTC to the server-local timezone.
If you don't want to process the entire file, select Limit records and type the desired number of records to read.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification. This can be useful when reading a wildcarded set of files. Select Output URI path to express the complete file specification as a Uniform Resource Identifier.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...to read fields
To configure the Avro Input tool to read fields:
Select the Avro Input tool.
Go to the Configuration tab on the Properties pane.
Select Input from and choose Field.
Select Commit to display the input connector, and then connect the desired input tool.
Select the Input field from which the data will be read, and then define the input schema.
If you have... | Do this |
---|---|
Serialized Avro bytes with no header | Select Specified, and then go to the Avro schema tab and either enter the schema directly or select and browse to the schema file (*.avsc, *.avro). |
Serialized Avro bytes with a Kafka header containing the schema ID | Select Kafka schema registry. The Kafka schema Registry URL must be configured in on the Resource tab in Site Settings. |
If Schema source is Kafka schema registry, specify the Kafka subject.
Once you specify an input field and schema, Data Management displays field definitions and record layout on the Configuration tab. Select Analyze to refresh the display.
Optionally, select the Options tab and select Do not convert Timestamps to local.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
Avro Output
The Avro Output tool writes data to the Avro data format.
Configuration parameters
The Avro Output tool has a single set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Output to Output file Output field | The destination of the data:
|
Open file time | If Output to is Output file, specifies when the output file will be opened: Default (uses the site/execution server setting), When project is started, When first record is read, or When last record is read. |
Include input fields | If Output to is Field, optionally pass through the "extra" input fields, making it easy to pack up an Avro payload in a field, while keeping an identifier field attached for downstream processing. |
Write empty file if no records are read | If selected, writes an output file even when no records are read. This is unavailable if Open file time is When project is started. |
Do not convert DateTime to UTC | If selected, Data Management DateTime fields are not converted from the server-local timezone to UTC before being written. Note that Data Management DateTime values are timezone free, meaningful only when interpreted in the context of an implied timezone. |
Schema output | Format for the data schema that describes the Avro output:
|
Kafka subject | If Schema output is Kafka schema registry, the Kafka subject name associated with the output. |
Save schema to .avsc | If configured, the Avro schema file (*.avsc) describing the output. |
Split files Split records | If selected, splits the output By size, By Record count, or By data. |
Split size (MB) | If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb). Note that the Avro file format has an internal block structure that may prevent file splits from occurring on files smaller than some minimal size. If Split records by size is selected, specifies Split size as the maximum payload size (in megabytes). |
Split count | If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000. If Split records by record count is selected, specifies Split count as maximum number of records in payload. |
Split field | If Split files by data is selected, name of the field to be used to split the data. A separate file or payload will be created for each unique value of the specified field. Data must be grouped by the split field. |
Suppress split field | If selected, the Split field is omitted from files created using Split files by data. |
Treat output file as folder | To generate file names where the entire name is determined by the value of the split field, select Treat output file as folder and specify the output directory in the Output file box, using the form: |
Compression codec | The compression codec used to compress blocks. While the Avro input and output tools support multiple compression types, we recommend Snappy for most purposes, as it is fast and stable.
|
Bytes per block | The number of bytes to write before flushing a block, with a range of 1000-10,000,000. Higher numbers tend to produce slightly better performance and compression ratios, at the expense of random-access performance for readers. Defaults to 100,000. |
Replication factor | Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1. |
Block size (MB) | The minimum size of a file division. The default is 128 MB. |
Configure the Avro Output tool
The procedure for configuring an Avro Output tool depends on the data target.
If the data target is... | Do this |
---|---|
A file or files | |
A field or datastream |
...to write files
To configure the Avro Output tool to write files:
Select the Avro Output tool.
go to the Configuration tab on the Properties pane.
Select Output to and choose File, and then specify the output file.
Optionally, specify Open file time.
Option | Description |
---|---|
Default | Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting. |
When project is started | Open output file when the project is run. |
When the first record is read | Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read. |
After the last record is read | Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read. |
Optionally, select Do not convert DateTime to UTC to retain the server-local timezone format.
Select the Schema output to define the Avro output:
Embedded in data: output is in Avro file format, with schema header.
None: serialized Avro bytes with no header.
Kafka schema registry: output is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Optionally, select Save schema to .avsc to configure an Avro schema file (*.avsc) describing the output.
Optionally, you can check the Split files box, and then select By size, By record count, or By data to split the output file into smaller, more manageable pieces.
If you select Split files by size, specify Split size as the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form
output_file01234.avro
.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form:
F:\output_directory
.If you do not want the specified field to appear in the output, select Suppress split field.
Optionally, select the Compression codec and adjust the Bytes per block.
None: passes through data uncompressed.
Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
LZO, XZ, and BZip2: write the data block using those respective data compression protocols.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
...to write fields
To configure the Avro Output tool to write fields:
Select the Avro Output2 tool.
Go to the Configuration tab on the Properties pane.
Select Output to and choose Field, and then enter an Output field name.
Optionally, select Do not convert DateTime to UTC to retain the server-local timezone format.
Select the Schema output to define the Avro output:
Embedded in data: output is in Avro file format, with schema header.
None: serialized Avro bytes with no header.
Kafka schema registry: output is serialized Avro bytes with a Kafka header containing a schema ID referencing a Kafka schema registry. If you select this Schema source, the Kafka schema Registry URL must be configured in on the Resource tab in Site Settings.
Optionally, select Save schema to .avsc to configure an Avro schema file (*.avsc) describing the output.
Optionally, you can select Split records, and then choose By size, By record count, or By data to split the output into smaller, more manageable packets to be sent to a web service or put in a queue.
If you select Split records by size, specify Split size as the maximum payload size (in megabytes).
If you select Split records by record count, specify Split count as maximum number of records.
If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field.
Optionally, select the Compression codec and adjust the Bytes per block.
None: passes through data uncompressed.
Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
LZO, XZ, and BZip2: write the data block using those respective data compression protocols.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.