Kafka Event Stream Modeling
Last updated 19 October 2016
Table of Contents
Apache Kafka on Heroku is an extremely powerful tool for creating modern application architectures, and for dealing with high-throughput event streams. Moving to a world of streaming event data, though, is not as simple as switching out the relational database that your ORM interacts with. To get the most out of streaming data, you must tune both your data and your Kafka configuration to support your product’s logic and needs.
Core Apache Kafka concepts
As covered in the Apache Kafka Dev Center article, there are a number of core concepts that are critical for understanding and tuning Apache Kafka on Heroku. The critical concepts for this discussion are topics and partitions.
Topics are the primary channel- or stream-like construct in Kafka, representing a type of event, much like a table would represent a type of record in a relational data store.
Topics are comprised of some number of partitions. Each partition will contain a discrete subset of the events (or messages, in Kafka parlance) belonging to a given topic.
Modifying the number and usage of these partitions is important to tuning your Kafka for your product, and for balancing ordering, parallelism, and resilience concerns.
Considerations to balance
The following are the key properties to balance when evaluating the partition structure for use with a given topic.
Messages within a given partition will be strictly ordered, but this ordering is not guaranteed across partitions.
Consumer group parallelism
A consumer group can have as many parallel consumers of a topic as there are partitions in the topic.
High numbers of partitions can increase the resource utilization and time to recover or re-elect leaders when brokers are recovering from failure.
Custom partition functions
Producers may choose arbitrary logic for sending messages to the partitions within a topic, using basic hashing for even distribution, or specific logic to maintain ordering and throughput semantics for a given product’s needs.
The consideration of these attributes, while not exhaustive, will provide a strong basis for the design of your topic’s partition structure.
Modeling to support your product logic
If strict ordering of your events is not paramount, but you require extremely high parallelism for throughput, it may make sense to go with a partition count high enough to work with a scaled out consumer group, but low enough to not impose undue burden on the cluster.
If strict ordering is important in your product’s logic, it is important to be clear about the domain under which that ordering matters. For instance, is ordering required globally, across all changes to all state? Or is ordering only required for changes related to a given user or account? Over what time period does ordering matter? It is often reasonable to build a compound key based off the attributes that matter for ordering, and to consistently hash messages to partitions based on those keys. For instance, partitioning by the combination of
session_id would provide for strict ordering of events related to a given user’s session, but would not maintain ordering across sessions or users.
The following are excellent resources from the broader Kafka community, that might be useful in optimizing the way that your partitions are modeled for your application’s need.
- How to choose the number of topics/partitions in a Kafka cluster? by Confluent
- Message delivery semantics in the core Apache Kafka documentation