Kafka Event Stream Modeling
Last updated November 30, 2022
Table of Contents
Apache Kafka on Heroku is a powerful tool for creating modern application architectures, and for dealing with high-throughput event streams. Moving to a world of streaming event data, though, isn’t as simple as switching out the relational database that your ORM interacts with. To get the most out of streaming data, you must tune both your data and your Kafka configuration to support your product’s logic and needs.
Core Apache Kafka concepts
As covered in the Apache Kafka on Heroku article, there are a number of core concepts that are critical for understanding and tuning Apache Kafka on Heroku. The critical concepts for this discussion are topics and partitions.
Topics are the primary channel- or stream-like construct in Kafka, representing a type of event, much like a table would represent a type of record in a relational data store.
Topics are comprised of some number of partitions. Each partition contains a discrete subset of the events (or messages, in Kafka parlance) belonging to a given topic.
Modifying the number and usage of these partitions is important to tuning your Kafka for your product, and for balancing ordering, parallelism, and resilience concerns.
Considerations to balance
The following are the key properties to balance when evaluating the partition structure for use with a given topic.
Message ordering
Messages within a given partition is strictly ordered, but this ordering isn’t guaranteed across partitions.
Consumer group parallelism
A consumer group can have as many parallel consumers of a topic as there are partitions in the topic.
Resource utilization
High numbers of partitions can increase the resource utilization and time to recover or re-elect leaders when brokers are recovering from failure.
Custom partition functions
Producers can choose arbitrary logic for sending messages to the partitions within a topic, using basic hashing for even distribution, or specific logic to maintain ordering and throughput semantics for a given product’s needs.
The consideration of these attributes, while not exhaustive, provides a strong basis for the design of your topic’s partition structure.
Modeling to support your product logic
If strict ordering of your events isn’t paramount, but you require extremely high parallelism for throughput, it makes sense to go with a partition count high enough to work with a scaled out consumer group, but low enough to not impose undue burden on the cluster.
If strict ordering is important in your product’s logic, it’s important to be clear about the domain under which that ordering matters. For instance, is ordering required globally, across all changes to all state? Or is ordering only required for changes related to a given user or account? Over what time period does ordering matter? It’s often reasonable to build a compound key based off the attributes that matter for ordering, and to consistently hash messages to partitions based on those keys. For instance, partitioning by the combination of user_id
and session_id
would provide for strict ordering of events related to a given user’s session, but wouldn’t maintain ordering across sessions or users.
Further reading
The following are excellent resources from the broader Kafka community that can be useful in optimizing the way that your partitions are modeled for your application’s need.
- How to choose the number of topics/partitions in a Kafka cluster? by Confluent
- Message delivery semantics in the core Apache Kafka documentation
- Best Practices for Running Kafka Connectors on Heroku