Kafka tools

Overview

Apache Kafka is a distributed streaming platform used for moving real-time streaming data between systems or applications, and building real-time streaming applications that transform or react to the streams of data. Kafka is run as a cluster on one or more servers that can span multiple data centers. The Kafka server cluster stores streams of data records in categories called topics. Each data record consists of a key, a value, and a timestamp.

Data Management's Kafka Input and Output tools let you move data records into and out of Kafka topics. These tools support a subset of Kafka functionality:

Support for auto-commit events only; no manual commits are allowed.
Support for serialization/deserialization of String, Integer, Long, Short, Bytes, Float, and Double data types.
No support for externally-managed offsets.
No support for transactions.
No support for manual partitioning.
No support for topics with heterogeneous Avro schemas.
Avro payloads with embedded schemas cannot be consumed by the Kafka Input tool because the Avro Input tool cannot parse an embedded schema from field input.

Kafka tool shared settings

Data Management's Kafka tools use shared settings, which allows you to define a single set of configuration properties (typically access credentials) to share across multiple tools in your Data Management Site. You can override these settings on a per-tool basis by opening the Shared settings section on the tool's Properties pane, selecting Override shared settings, and specifying values for that specific tool.

To define Kafka shared tool settings

Open the Tools folder under Settings in the repository.
Select the Kafka tab.
Go to the Properties pane.
Configure the tool properties for your environment:

Property	Description
Bootstrap server	A comma-separated list of one or more host-port pairs that are the addresses of the Kafka brokers in the "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself. These may be of the form `localhost:9092,another.host:9092`.
Advanced settings	While Data Management only exposes a small subset of configuration options on the Kafka Input and Output tool Property pages, you can optionally define other options by specifying name/value pairs in the format: TEXT `auto.commit.interval.ms=1000 acks=all retries=0 batch.size=16384 linger.ms=1 buffer.memory=33554432`

To configure default shared tool settings from a Kafka Input or Output tool's Properties pane, open the Shared settings section, and select Edit default settings.

To override Kafka tool shared settings

Select the desired Kafka tool.
Go to the Configuration tab on the Properties pane.
Open the Shared settings section, select Override shared settings, and specify new values for the tool.

Kafka Input

The Kafka Input tool reads events from one or more Kafka topics and outputs them to a single M (Message/Events) connector. These events have the following format.

Field	Description
Key	The partition key of this event, represented as a TextVar.
Offset	The offset of this event, a sequential ID number that uniquely identifies the record within the partition.
Value	The event value, represented as a TextVar.

Kafka Input tool configuration parameters

The Kafka Input tool has a single set of configuration parameters in addition to the standard execution options.

Parameter	Description
Topics	Comma-separated list of topics that events should be read from.
Batch size	Specifies how many events the consumer will buffer before committing offsets to the topic. Adjust this property to balance performance and robustness. Smaller batch sizes may reduce throughput, while large batch sizes may reduce the project’s durability in the event an abnormal exit. Note that a very large batch size may exceed available memory.
Group ID	The name of the consumer group to which this input tool consumer belongs.
Auto offset	Defines the behavior when an existing offset cannot be found: Latest: (default) begin reading from the topic's most recent offset. Earliest: begin reading from the topic's oldest offset. None: fail with an error if no existing offset can be found.
Poll timeout	The duration (in seconds) the consumer should wait for records to become available on the subscribed topics.
Key deserializer	Key field data type. One of `String`, `Integer`, `Long`, `Short`, `Bytes`, `Float`, or `Double`.
Value deserializer	Value field data type. One of `String`, `Integer`, `Long`, `Short`, `Bytes`, `Float`, or `Double`.
Event limit	The number of events the tool will read before exiting.
Time limit	The amount of time (in seconds) the tool will read events before exiting.
Override shared settings	If selected, uses the Bootstrap server and Advanced settings defined in the tool properties rather than the Kafka tool shared settings defined in the repository.

Configure the Kafka Input tool

Before configuring a Kafka tool, you should have a Kafka connection defined in shared settings.

Select the Kafka Input tool.
Select the Configuration tab.
Specify the comma-separated list of Topics that events will be read from.
Optionally, edit Batch size.
Specify Group ID as the name of the consumer group to which this input tool consumer belongs.
Optionally, select Auto offset and select the tool's behavior when an existing offset can't be found:
- Latest: begin reading from the topic's most recent offset.
- Earliest: begin reading from the topic's oldest offset.
- None: fail with an error if no existing offset can be found.
Optionally, edit Poll timeout (amount of time in seconds the tool will read events before exiting).
Select Key deserializer and Value deserializer data types.
Optionally, edit Event limit (number of message the tool will read before exiting) and Time limit (amount of time in seconds the tool will read events before exiting).
Optionally, override shared settings.
Optionally, go to the Execution tab, and then set Web service options.

Kafka Output

The Kafka Output tool writes events to a Kafka topic. The tool expects two fields on its single input connector:

A key field, designated by the Key field parameter.
An event field, designated by the Event field parameter.

The tool has two output connectors: Success (S) and Failure (F).

The Success output connector emits records that were successfully sent to the Kafka topic.

Field	Type	Description
Key	Specified by the Key serializer property.	Event key
Value	Specified by the Value serializer property.	Event value
Offset	Integer(8)	Event’s topic offset

The Failure output connector emits records that could not be sent to the Kafka topic.

Field	Type	Description
Key	Specified by the Key serializer property	Event key
Value	Specified by the Value serializer property	Event value
Exception	TextVar	Description of the error that prevented the record from being sent

Kafka Output tool configuration parameters

The Kafka Output tool has one set of configuration parameters in addition to the standard execution options.

Parameter	Description
Topics	Comma-separated list of topics that events should be written to.
Batch size	Specifies how many events the consumer will buffer before sending events to the topic. Adjust this property to balance performance and robustness. Smaller batch sizes may reduce throughput, while large batch sizes may reduce the project’s durability in the event an abnormal exit. Note that a very large batch size may exceed available memory.
Key field	The partition key for the event. If present, Kafka will hash the key and use it to assign the event to a partition. If absent, Kafka will distribute events across all partitions.
Key serializer	Key field data type. One of `String`, `Integer`, `Long`, `Short`, `Bytes`, `Float`, or `Double`.
Value field	The field from which the event will be read.
Value serializer	Value field data type. One of `String`, `Integer`, `Long`, `Short`, `Bytes`, `Float`, or `Double`.
Override shared settings	If selected, uses the Bootstrap server and Advanced settings defined in the tool properties rather than the Kafka tool shared settings defined in the repository.

Configure the Kafka Output tool

Before configuring a Kafka tool, you should have a Kafka connection defined in shared settings.

Select the Kafka Output tool.
Go to the Configuration tab on the Properties pane.
Specify the comma-separated list of Topics that events will be written to.
Optionally, edit Batch size.
Optionally specify the Key field, or leave blank to distribute events across all partitions.
Specify the Key serializer data type: String, Integer, Long, Short, Bytes, Float, or Double.
Specify the Event field.
Specify the Event serializer data type: String, Integer, Long, Short, Bytes, Float, or Double.
Optionally, override shared settings.
Optionally, go to the Execution tab, and then set Web service options.