Kafka Event Stream Modeling

Last updated November 30, 2022

Core Apache Kafka concepts
Considerations to balance
Modeling to support your product logic
Further reading

Apache Kafka on Heroku is a powerful tool for creating modern application architectures, and for dealing with high-throughput event streams. Moving to a world of streaming event data, though, isn’t as simple as switching out the relational database that your ORM interacts with. To get the most out of streaming data, you must tune both your data and your Kafka configuration to support your product’s logic and needs.

Core Apache Kafka concepts

As covered in the Apache Kafka on Heroku article, there are a number of core concepts that are critical for understanding and tuning Apache Kafka on Heroku. The critical concepts for this discussion are topics and partitions.

Topics are the primary channel- or stream-like construct in Kafka, representing a type of event, much like a table would represent a type of record in a relational data store.

Topics are comprised of some number of partitions. Each partition contains a discrete subset of the events (or messages, in Kafka parlance) belonging to a given topic.

Modifying the number and usage of these partitions is important to tuning your Kafka for your product, and for balancing ordering, parallelism, and resilience concerns.

Considerations to balance

The following are the key properties to balance when evaluating the partition structure for use with a given topic.

Message ordering

Messages within a given partition is strictly ordered, but this ordering isn’t guaranteed across partitions.

Consumer group parallelism

A consumer group can have as many parallel consumers of a topic as there are partitions in the topic.

Resource utilization

High numbers of partitions can increase the resource utilization and time to recover or re-elect leaders when brokers are recovering from failure.

Custom partition functions

Producers can choose arbitrary logic for sending messages to the partitions within a topic, using basic hashing for even distribution, or specific logic to maintain ordering and throughput semantics for a given product’s needs.

The consideration of these attributes, while not exhaustive, provides a strong basis for the design of your topic’s partition structure.

Modeling to support your product logic

If strict ordering of your events isn’t paramount, but you require extremely high parallelism for throughput, it makes sense to go with a partition count high enough to work with a scaled out consumer group, but low enough to not impose undue burden on the cluster.

If strict ordering is important in your product’s logic, it’s important to be clear about the domain under which that ordering matters. For instance, is ordering required globally, across all changes to all state? Or is ordering only required for changes related to a given user or account? Over what time period does ordering matter? It’s often reasonable to build a compound key based off the attributes that matter for ordering, and to consistently hash messages to partitions based on those keys. For instance, partitioning by the combination of user_id and session_id would provide for strict ordering of events related to a given user’s session, but wouldn’t maintain ordering across sessions or users.

Categories

Kafka Event Stream Modeling

Table of Contents

Core Apache Kafka concepts

Considerations to balance

Message ordering

Consumer group parallelism

Resource utilization

Custom partition functions

Modeling to support your product logic

Further reading

Keep reading

Feedback