Robust Usage of Apache Kafka on Heroku
Last updated December 12, 2022
Table of Contents
Apache Kafka is designed to be reliable as a whole, being a distributed system, with heavy amounts of replication and careful thought around high availability and failure handling. There are, however, several considerations you must make when developing applications that use Kafka to ensure they remain robust in the face of some of the practical realities:
- Partition leaders can change
- Kafka is “at least once”
- Consumers must be careful about resetting offsets
- Client objects must be reused through the whole application lifecycle
- Be careful when using lowlevel Kafka client APIs
Apache Kafka on Heroku routinely replaces instances for many reasons: handling instance failure, upgrading underlying software, upgrading Kafka itself, and so on. An application that isn’t robust to these failures sees errors, and potentially breaks in unexpected ways during these events.
1. Partition Leader Elections
Failure scenario: The way Kafka itself handles failures of an individual broker, is by electing new “leader” brokers for each partition that resided on the failed broker. This process typically happens within 10-20 s or so of the broker failure (if not faster), and full functionality is restored. After the broker has recovered, Apache Kafka on Heroku remigrates partition leadership to that broker.
Potential impact to your application: Dropped messages, duplicate messages, increased error rates.
How to prevent this scenario: Most properly configured, higher-level Kafka client libraries are robust in the face of partition leader elections. However, it’s required that you ensure that producers have retries configured, so that while leadership is being transferred, your application only sees increased latency, and retries on errors.
If you’re using lower-level Kafka APIs, it’s important to handle partition leadership changing. The basic mechanism for doing that is checking for any
NotLeaderForPartition errors, and repolling Kafka’s metadata endpoints to fetch the new leader details.
2. Kafka Is “At Least Once”
Failure scenario: Your application process restarts, or Kafka itself suffers a failure. In the presence of instance failure, Kafka’s underlying replication prefers duplicating messages over dropping them.
Potential impact to your application: Duplicate messages when you didn’t expect it.
How to handle this scenario: When building a system on top of Kafka, ensure that your consumers can handle duplicates of any individual message. There are two common remediations:
Idempotency: Make it so that receiving the same message twice in your consumer doesn’t change anything. For example, if your consumer is writing to a SQL database, use of UPSERT likely allows you to safely receive duplicate messages.
Duplicate message tracking: Insert some unique identifier in each message, then track which identifiers each consumer has seen, and ignore any identifiers that are duplicated.
3. Consumers Must Be Careful About Resetting Offsets
Some consumers have mechanisms for automatically resetting offsets on some Kafka errors. These mechanisms are typically not safe in production, and it’s recommended you handle these cases manually.
Failure scenario: Your consumer receives an “offset out of range error” from Kafka and decides to reset its offset somewhere.
Potential impact to your application: Your application can either start replaying from the first retained offset, or from the last, resulting in either it being far behind, or missing messages.
How to handle this scenario: Don’t write application code that automatically resets which offset your consumer is tracking. Don’t enable any consumer configuration options that do that either - prefer paging a human being.
4. Reuse Client Objects
Failure scenario: When an individual broker has a failure for some reason, client applications that don’t have reuse of client objects see issues.
Potential impact to your application: If you create a new client, for example, per http request that comes in, your application sees increased latency and errors talking to Kafka during broker failure. This is because client libraries do notable work on first connecting to the cluster. Bringing up new clients on each request will cause that work to happen again, potentially against the failed broker.
How to handle this scenario: Always have long running usage of Kafka client objects in your application. Construct producer and consumer objects early on in the application lifecycle and reuse them.
5. Low-Level Kafka Client APIs
Failure scenario: Using “low-level” Kafka APIs where you for example, send individual requests to brokers is complex in the face of failure.
Potential impact to your application: Errors, potential data loss, potential downtime.
How to prevent this scenario: Avoid, as much as possible, using low-level Kafka client APIs. Anything that involves specifying individual brokers is hard to get right in the face of failure. If you do use low-level APIs, make sure that you test your code in the face of Kafka-level failure.
Likewise, avoid any kind of logic that requires coordinated fetches from all of the partitions in a topic. Consumers must be robust to any individual partition undergoing temporary failure, as the leader for any given partition can be undergoing re-election or rebalance.
Lastly: Test Failure in Staging Environments
Apache Kafka on Heroku has a command
heroku kafka:fail that allows you to cause an instance failure on one of your brokers. Using this command is the best way to check whether your application can handle Kafka-level failures, and you can do so at any time in a staging environment. The syntax is like this:
heroku kafka:fail KAFKA_URL --app sushi
sushi is the name of your app.
Don’t use this command lightly! Failing brokers in production is never going to improve stability, and dramatically decreases the stability of the cluster.
Likewise, you can test failure in a development environment by spinning up multiple brokers, and killing the process for one of them.