Best Practices for Heroku's Streaming Data Connectors
Last updated March 11, 2021
This article describes several notes around best practices for operating your streaming data connector. This integration is supported by Apache Kafka Connect and Debezium, so you will see references to these components when discussing configuration options.
When Things Go Wrong (Debezium)
Points to Note
- Debezium only works with UTF-8 databases. If you are using a different encoding, you will need to change it to UTF-8 before enabling PG2K. If you change to another encoding after the connector is created, the connector will stop processing events.
- When a new connector is created, it connects to a Postgres database and reads a consistent snapshot of all of the schemas. This could be an expensive operation and cause load on both the source Postgres database and the target Kafka cluster. You should take care to choose tables and columns accordingly. More details available in the documentation.
- The “before” data in CDC events will only be populated for the parts of the table that are part of the
REPLICA IDENTITY. By default, in most situations, this means that only the primary key will be reflected in the “before” data. More details available in the documentation.
- Making changes to primary keys must be coordinated carefully to avoid issues with schema information. More details available in the documentation.
- CDC events are designed to be at least once delivery. You will need to build and configure your Kafka consumers to handle redundant CDC events gracefully.
UPDATEevents, unchanged TOASTed values will have a placeholder
__debezium_unavailable_value. If you do not account for this, you might end up using the placeholder as if it were the real value. Find more details in the documentation.
- Database rows with very large amounts of data cannot be produced into Kafka messages if they exceed the maximum message size for the topic (default: 1 MB). These changes will be logged (internally), but the event will not be produced (silently).
- Kafka topics are created with more than 1 partition. As a result, change events do not have global total ordering when being consumed. Change events for a specific row are totally ordered.
- If a connector stops replicating, the replication slot that tracks the connector’s progress will prevent WAL from being deleted. If the Postgres database runs out of disk for WAL, the database will stop entirely. Find more details in the documentation.
- Failures can occur for many reasons, leading the connector to stop. Some include a network partition, an AWS issue, a bug in Kafka Connect or Debezium, or an issue with our control plane. It is important that you monitor your database’s replication slots for these conditions.
- In certain failure modes, we may need to remove the replication slot to preserve the operational stability of the database. When conditions have cleared and the replication slot is created again, the connector will create a new replication slot and begin publishing events from that point of recovery. Change events that were not streamed to Kafka before the database went down will be lost.
- Connectors should not be left in a “paused” state for more than a few hours. Paused connectors will prevent WAL from being deleted, which can put the primary database at risk. It is better to destroy the connector than to leave it paused for a long period.
- Change events that occur while a connector is paused are not guaranteed to make it into Kafka. In the event of a failover (due to a system failure or a scheduled maintenance), change events after the connector was paused will be lost.
- Deleting a table will tombstone the corresponding Kafka topic, but not destroy it.
- If you deprovision a connector, we will not deprovision the associated Kafka topics. If you want to remove this data, then you must delete these topics yourself using the CLI:
heroku kafka:topics:destroy <topic_name> --app <app_name>. Find more details in the documentation.