Skip Navigation
Show nav
Heroku Dev Center
  • Get Started
  • Documentation
  • Changelog
  • Search
  • Get Started
    • Node.js
    • Ruby on Rails
    • Ruby
    • Python
    • Java
    • PHP
    • Go
    • Scala
    • Clojure
  • Documentation
  • Changelog
  • More
    Additional Resources
    • Home
    • Elements
    • Products
    • Pricing
    • Careers
    • Help
    • Status
    • Events
    • Podcasts
    • Compliance Center
    Heroku Blog

    Heroku Blog

    Find out what's new with Heroku on our blog.

    Visit Blog
  • Log inorSign up
View categories

Categories

  • Heroku Architecture
    • Dynos (app containers)
    • Stacks (operating system images)
    • Networking & DNS
    • Platform Policies
    • Platform Principles
  • Command Line
  • Deployment
    • Deploying with Git
    • Deploying with Docker
    • Deployment Integrations
  • Continuous Delivery
    • Continuous Integration
  • Language Support
    • Node.js
    • Ruby
      • Rails Support
      • Working with Bundler
    • Python
      • Working with Django
      • Background Jobs in Python
    • Java
      • Working with Maven
      • Java Database Operations
      • Working with the Play Framework
      • Working with Spring Boot
      • Java Advanced Topics
    • PHP
    • Go
      • Go Dependency Management
    • Scala
    • Clojure
  • Databases & Data Management
    • Heroku Postgres
      • Postgres Basics
      • Postgres Performance
      • Postgres Data Transfer & Preservation
      • Postgres Availability
      • Postgres Special Topics
    • Heroku Redis
    • Apache Kafka on Heroku
    • Other Data Stores
  • Monitoring & Metrics
    • Logging
  • App Performance
  • Add-ons
    • All Add-ons
  • Collaboration
  • Security
    • App Security
    • Identities & Authentication
    • Compliance
  • Heroku Enterprise
    • Private Spaces
      • Infrastructure Networking
    • Enterprise Accounts
    • Enterprise Teams
    • Heroku Connect (Salesforce sync)
    • Single Sign-on (SSO)
  • Patterns & Best Practices
  • Extending Heroku
    • Platform API
    • App Webhooks
    • Heroku Labs
    • Building Add-ons
      • Add-on Development Tasks
      • Add-on APIs
      • Add-on Guidelines & Requirements
    • Building CLI Plugins
    • Developing Buildpacks
    • Dev Center
  • Accounts & Billing
  • Troubleshooting & Support
  • Databases & Data Management
  • Apache Kafka on Heroku
  • Robust Usage of Apache Kafka on Heroku

Robust Usage of Apache Kafka on Heroku

English — 日本語に切り替える

Last updated 16 May 2017

Table of Contents

  • 1. Partition Leader Elections
  • 2. Kafka is “at least once”
  • 3. Consumers should be careful about resetting offsets
  • 4. Reuse client objects
  • 5. Low Level Kafka Client APIs
  • Lastly: Test failure in staging environments

Apache Kafka is designed to be very reliable as a whole, being a distributed system, with heavy amounts of replication and careful thought around high availability and failure handling. There are, however, several considerations you must make when developing applications that use Kafka to ensure they remain robust in the face of some of the practical realities:

  1. Partition leaders can change
  2. Kafka is “at least once”
  3. Consumers should be careful about resetting offsets
  4. Client objects should be reused through the whole application lifecycle
  5. Be careful when using lowlevel Kafka client APIs

Apache Kafka on Heroku routinely replaces instances for many reasons: handling instance failure, upgrading underlying software, upgrading Kafka itself, and so on. An application that isn’t robust to these failures will see errors, and will potentially break in unexpected ways during these events.

1. Partition Leader Elections

Failure Scenario: The way Kafka itself handles failures of an individual broker, is by electing new “leader” brokers for each partition that resided on the failed broker. This typically happens within 10-20s or so of the broker failure (if not faster), and full functionality will be restored. Once the broker has recovered, Apache Kafka on Heroku will re-migrate partition leadership to that broker.

Potential impact to your application: Dropped messages, duplicate messages, increased error rates.

How to prevent this: Most properly configured, higher-level Kafka client libraries are robust in the face of partition leader elections. However, it is required that you ensure that producers have retries configured, so that whilst leadership is being transferred, your application will only see increased latency, and will retry on errors.

If you’re using lower level Kafka APIs, it’s important to handle partition leadership changing. The basic mechanism for doing that is checking for any NotLeaderForPartition errors, and re-polling Kafka’s metadata endpoints to fetch the new leader details.

2. Kafka is “at least once”

Failure Scenario: Your application process restarts, or Kafka itself suffers a failure. In the presence of instance failure, Kafka’s underlying replication prefers duplicating messages over dropping them.

Potential impact to your application: Duplicate messages when you didn’t expect it.

How to handle this: When building a system on top of Kafka, ensure that your consumers can handle duplicates of any individual message. There are two common remediations:

Idempotency: Make it so that receiving the same message twice in your consumer doesn’t change anything. For example, if your consumer is writing to a SQL database, use of UPSERT will likely allow you to safely receive duplicate messages.

Duplicate message tracking: Insert some unique identifier in each message, then track which identifiers each consumer has seen, and ignore any identifiers that are duplicated.

3. Consumers should be careful about resetting offsets

Some consumers have mechanisms for automatically resetting offsets on some Kafka errors. These mechanisms are typically not safe in production, and it’s recommended you handle these cases manually.

Failure Scenario: Your consumer receives an “offset out of range error” from Kafka and decides to reset its offset somewhere.

Potential impact to your application: Your application can either start replaying from the first retained offset, or from the last, resulting in either it being very far behind, or missing messages.

How to handle this: Don’t write application code that automatically resets which offset your consumer is tracking. Don’t enable any consumer configuration options that do that either - prefer paging a human being.

4. Reuse client objects

Failure Scenario: When an individual broker has a failure for some reason, client applications that don’t have reuse of client objects will see issues.

Potential impact to your application: If you create a new client, e.g. per http request that comes in, your application will see increased latency and errors talking to Kafka during broker failure. This is because client libraries do notable work on first connecting to the cluster, and bringing up new clients on each request will cause that work to happen again, potentially against the failed broker.

How to handle this: Always have long running usage of Kafka client objects in your application. Construct producer and consumer objects early on in the application lifecycle and reuse them.

5. Low Level Kafka Client APIs

Failure Scenario: Using “low-level” Kafka APIs where you e.g. send individual requests to brokers is quite complex in the face of failure.

Potential Impact to your application: Errors, potential data loss, potential downtime.

How to prevent this: Avoid, as much as possible, using low level Kafka client APIs. Anything that involves specifying individual brokers is very hard to get right in the face of failure. If you do use low-level APIs, make sure that you test your code in the face of Kafka level failure.

Likewise, avoid any kind of logic that requires coordinated fetches from all of the partitions in a topic. Consumers should be robust to any individual partition undergoing temporary failure, as the leader for any given partition may be undergoing re-election or re-balance.

Lastly: Test failure in staging environments

Apache Kafka on Heroku has a command heroku kafka:fail that allows you to cause an instance failure on one of your brokers. Using this command is the best way to check whether your application can handle Kafka-level failures, and you can do so at any time in a staging environment. The syntax is like this:

heroku kafka:fail KAFKA_URL --app sushi

Where sushi is the name of your app.

This command should not be used lightly! Failing brokers in production is never going to improve stability, and may dramatically decrease the stability of the cluster.

Likewise, you can test failure in a development environment by spinning up multiple brokers, and killing the process for one of them.

Keep reading

  • Apache Kafka on Heroku

Feedback

Log in to submit feedback.

Information & Support

  • Getting Started
  • Documentation
  • Changelog
  • Compliance Center
  • Training & Education
  • Blog
  • Podcasts
  • Support Channels
  • Status

Language Reference

  • Node.js
  • Ruby
  • Java
  • PHP
  • Python
  • Go
  • Scala
  • Clojure

Other Resources

  • Careers
  • Elements
  • Products
  • Pricing

Subscribe to our monthly newsletter

Your email address:

  • RSS
    • Dev Center Articles
    • Dev Center Changelog
    • Heroku Blog
    • Heroku News Blog
    • Heroku Engineering Blog
  • Heroku Podcasts
  • Twitter
    • Dev Center Articles
    • Dev Center Changelog
    • Heroku
    • Heroku Status
  • Facebook
  • Instagram
  • Github
  • LinkedIn
  • YouTube
Heroku is acompany

 © Salesforce.com

  • heroku.com
  • Terms of Service
  • Privacy
  • Cookies