Robust Usage of Apache Kafka on Heroku
Last updated 16 May 2017
Table of Contents
Apache Kafka is designed to be very reliable as a whole, being a distributed system, with heavy amounts of replication and careful thought around high availability and failure handling. There are, however, several considerations you must make when developing applications that use Kafka to ensure they remain robust in the face of some of the practical realities:
- Partition leaders can change
- Kafka is “at least once”
- Consumers should be careful about resetting offsets
- Client objects should be reused through the whole application lifecycle
- Be careful when using lowlevel Kafka client APIs
Apache Kafka on Heroku routinely replaces instances for many reasons: handling instance failure, upgrading underlying software, upgrading Kafka itself, and so on. An application that isn’t robust to these failures will see errors, and will potentially break in unexpected ways during these events.
1. Partition Leader Elections
Failure Scenario: The way Kafka itself handles failures of an individual broker, is by electing new “leader” brokers for each partition that resided on the failed broker. This typically happens within 10-20s or so of the broker failure (if not faster), and full functionality will be restored. Once the broker has recovered, Apache Kafka on Heroku will re-migrate partition leadership to that broker.
Potential impact to your application: Dropped messages, duplicate messages, increased error rates.
How to prevent this: Most properly configured, higher-level Kafka client libraries are robust in the face of partition leader elections. However, it is required that you ensure that producers have retries configured, so that whilst leadership is being transferred, your application will only see increased latency, and will retry on errors.
If you’re using lower level Kafka APIs, it’s important to handle partition leadership changing. The basic mechanism for doing that is checking for any NotLeaderForPartition
errors, and re-polling Kafka’s metadata endpoints to fetch the new leader details.
2. Kafka is “at least once”
Failure Scenario: Your application process restarts, or Kafka itself suffers a failure. In the presence of instance failure, Kafka’s underlying replication prefers duplicating messages over dropping them.
Potential impact to your application: Duplicate messages when you didn’t expect it.
How to handle this: When building a system on top of Kafka, ensure that your consumers can handle duplicates of any individual message. There are two common remediations:
Idempotency: Make it so that receiving the same message twice in your consumer doesn’t change anything. For example, if your consumer is writing to a SQL database, use of UPSERT will likely allow you to safely receive duplicate messages.
Duplicate message tracking: Insert some unique identifier in each message, then track which identifiers each consumer has seen, and ignore any identifiers that are duplicated.
3. Consumers should be careful about resetting offsets
Some consumers have mechanisms for automatically resetting offsets on some Kafka errors. These mechanisms are typically not safe in production, and it’s recommended you handle these cases manually.
Failure Scenario: Your consumer receives an “offset out of range error” from Kafka and decides to reset its offset somewhere.
Potential impact to your application: Your application can either start replaying from the first retained offset, or from the last, resulting in either it being very far behind, or missing messages.
How to handle this: Don’t write application code that automatically resets which offset your consumer is tracking. Don’t enable any consumer configuration options that do that either - prefer paging a human being.
4. Reuse client objects
Failure Scenario: When an individual broker has a failure for some reason, client applications that don’t have reuse of client objects will see issues.
Potential impact to your application: If you create a new client, e.g. per http request that comes in, your application will see increased latency and errors talking to Kafka during broker failure. This is because client libraries do notable work on first connecting to the cluster, and bringing up new clients on each request will cause that work to happen again, potentially against the failed broker.
How to handle this: Always have long running usage of Kafka client objects in your application. Construct producer and consumer objects early on in the application lifecycle and reuse them.
5. Low Level Kafka Client APIs
Failure Scenario: Using “low-level” Kafka APIs where you e.g. send individual requests to brokers is quite complex in the face of failure.
Potential Impact to your application: Errors, potential data loss, potential downtime.
How to prevent this: Avoid, as much as possible, using low level Kafka client APIs. Anything that involves specifying individual brokers is very hard to get right in the face of failure. If you do use low-level APIs, make sure that you test your code in the face of Kafka level failure.
Likewise, avoid any kind of logic that requires coordinated fetches from all of the partitions in a topic. Consumers should be robust to any individual partition undergoing temporary failure, as the leader for any given partition may be undergoing re-election or re-balance.
Lastly: Test failure in staging environments
Apache Kafka on Heroku has a command heroku kafka:fail
that allows you to cause an instance failure on one of your brokers. Using this command is the best way to check whether your application can handle Kafka-level failures, and you can do so at any time in a staging environment. The syntax is like this:
heroku kafka:fail KAFKA_URL --app sushi
Where sushi
is the name of your app.
This command should not be used lightly! Failing brokers in production is never going to improve stability, and may dramatically decrease the stability of the cluster.
Likewise, you can test failure in a development environment by spinning up multiple brokers, and killing the process for one of them.