Kafka tools
Overview
Apache Kafka is a distributed streaming platform used for moving real-time streaming data between systems or applications, and building real-time streaming applications that transform or react to the streams of data. Kafka is run as a cluster on one or more servers that can span multiple data centers. The Kafka server cluster stores streams of data records in categories called topics. Each data record consists of a key, a value, and a timestamp.
Data Management's Kafka Input and Output tools let you move data records into and out of Kafka topics. These tools support a subset of Kafka functionality:
Support for auto-commit events only; no manual commits are allowed.
Support for serialization/deserialization of String, Integer, Long, Short, Bytes, Float, and Double data types.
No support for externally-managed offsets.
No support for transactions.
No support for manual partitioning.
No support for topics with heterogeneous Avro schemas.
Avro payloads with embedded schemas cannot be consumed by the Kafka Input tool because the Avro Input tool cannot parse an embedded schema from field input.
Kafka tool shared settings
Data Management's Kafka tools use shared settings, which allows you to define a single set of configuration properties (typically access credentials) to share across multiple tools in your Data Management Site. You can override these settings on a per-tool basis by opening the Shared settings section on the tool's Properties pane, selecting Override shared settings, and specifying values for that specific tool.
To define Kafka shared tool settings
Open the Tools folder under Settings in the repository.
Select the Kafka tab.
Go to the Properties pane.
Configure the tool properties for your environment:
Property | Description |
---|---|
Bootstrap server | A comma-separated list of one or more host-port pairs that are the addresses of the Kafka brokers in the "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself. These may be of the form |
Advanced settings | While Data Management only exposes a small subset of configuration options on the Kafka Input and Output tool Property pages, you can optionally define other options by specifying name/value pairs in the format:
TEXT
|
To configure default shared tool settings from a Kafka Input or Output tool's Properties pane, open the Shared settings section, and select Edit default settings.
To override Kafka tool shared settings
Select the desired Kafka tool.
Go to the Configuration tab on the Properties pane.
Open the Shared settings section, select Override shared settings, and specify new values for the tool.
Kafka Input
The Kafka Input tool reads events from one or more Kafka topics and outputs them to a single M (Message/Events) connector. These events have the following format.
Field | Description |
---|---|
Key | The partition key of this event, represented as a TextVar. |
Offset | The offset of this event, a sequential ID number that uniquely identifies the record within the partition. |
Value | The event value, represented as a TextVar. |
Kafka Input tool configuration parameters
The Kafka Input tool has a single set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Topics | Comma-separated list of topics that events should be read from. |
Batch size | Specifies how many events the consumer will buffer before committing offsets to the topic. Adjust this property to balance performance and robustness. Smaller batch sizes may reduce throughput, while large batch sizes may reduce the project’s durability in the event an abnormal exit. Note that a very large batch size may exceed available memory. |
Group ID | The name of the consumer group to which this input tool consumer belongs. |
Auto offset | Defines the behavior when an existing offset cannot be found:
|
Poll timeout | The duration (in seconds) the consumer should wait for records to become available on the subscribed topics. |
Key deserializer | Key field data type. One of |
Value deserializer | Value field data type. One of |
Event limit | The number of events the tool will read before exiting. |
Time limit | The amount of time (in seconds) the tool will read events before exiting. |
Override shared settings | If selected, uses the Bootstrap server and Advanced settings defined in the tool properties rather than the Kafka tool shared settings defined in the repository. |
Configure the Kafka Input tool
Before configuring a Kafka tool, you should have a Kafka connection defined in shared settings.
Select the Kafka Input tool.
Select the Configuration tab.
Specify the comma-separated list of Topics that events will be read from.
Optionally, edit Batch size.
Specify Group ID as the name of the consumer group to which this input tool consumer belongs.
Optionally, select Auto offset and select the tool's behavior when an existing offset can't be found:
Latest: begin reading from the topic's most recent offset.
Earliest: begin reading from the topic's oldest offset.
None: fail with an error if no existing offset can be found.
Optionally, edit Poll timeout (amount of time in seconds the tool will read events before exiting).
Select Key deserializer and Value deserializer data types.
Optionally, edit Event limit (number of message the tool will read before exiting) and Time limit (amount of time in seconds the tool will read events before exiting).
Optionally, override shared settings.
Optionally, go to the Execution tab, and then set Web service options.
Kafka Output
The Kafka Output tool writes events to a Kafka topic. The tool expects two fields on its single input connector:
A key field, designated by the Key field parameter.
An event field, designated by the Event field parameter.
The tool has two output connectors: Success (S) and Failure (F).
The Success output connector emits records that were successfully sent to the Kafka topic.
Field | Type | Description |
---|---|---|
Key | Specified by the Key serializer property. | Event key |
Value | Specified by the Value serializer property. | Event value |
Offset | Integer(8) | Event’s topic offset |
The Failure output connector emits records that could not be sent to the Kafka topic.
Field | Type | Description |
---|---|---|
Key | Specified by the Key serializer property | Event key |
Value | Specified by the Value serializer property | Event value |
Exception | TextVar | Description of the error that prevented the record from being sent |
Kafka Output tool configuration parameters
The Kafka Output tool has one set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Topics | Comma-separated list of topics that events should be written to. |
Batch size | Specifies how many events the consumer will buffer before sending events to the topic. Adjust this property to balance performance and robustness. Smaller batch sizes may reduce throughput, while large batch sizes may reduce the project’s durability in the event an abnormal exit. Note that a very large batch size may exceed available memory. |
Key field | The partition key for the event. If present, Kafka will hash the key and use it to assign the event to a partition. If absent, Kafka will distribute events across all partitions. |
Key serializer | Key field data type. One of |
Value field | The field from which the event will be read. |
Value serializer | Value field data type. One of |
Override shared settings | If selected, uses the Bootstrap server and Advanced settings defined in the tool properties rather than the Kafka tool shared settings defined in the repository. |
Configure the Kafka Output tool
Before configuring a Kafka tool, you should have a Kafka connection defined in shared settings.
Select the Kafka Output tool.
Go to the Configuration tab on the Properties pane.
Specify the comma-separated list of Topics that events will be written to.
Optionally, edit Batch size.
Optionally specify the Key field, or leave blank to distribute events across all partitions.
Specify the Key serializer data type:
String
,Integer
,Long
,Short
,Bytes
,Float
, orDouble
.Specify the Event field.
Specify the Event serializer data type:
String
,Integer
,Long
,Short
,Bytes
,Float
, orDouble
.Optionally, override shared settings.
Optionally, go to the Execution tab, and then set Web service options.