Best Practices for Heroku's streaming data connectors
Last updated 17 July 2020
This is a Beta feature. Any use of Beta Services is subject to the terms in your Master Subscription Agreement and the Beta Services terms. These terms include provisions that the following types of sensitive Personal Data (including images, sounds or other information containing or revealing such sensitive data) may not be submitted to Data Science Programs, Non-GA Service and Non-GA Software: government-issued identification numbers; financial information (such as credit or debit card numbers, any related security codes or passwords, and bank account numbers); racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, information concerning health or sex life; information related to an individual’s physical or mental health; and information related to the provision or payment of health care.
This article describes several notes around best practices for operating your streaming data connector. This integration is supported by Apache Kafka Connect and Debezium, so you will see references to these components when discussing configuration options.
When Things Go Wrong (Debezium)
Points to Note
- Debezium only works with UTF-8 databases. If you are using a different encoding, you will need to change it to UTF-8 before enabling PG2K. If you change to another encoding after the connector is created, the connector will stop processing events.
- When a new connector is created, it connects to a Postgres database and reads a consistent snapshot of all of the schemas. This could be an expensive operation and cause load on both the source Postgres database and the target Kafka cluster. You should take care to choose tables and columns accordingly. More details available in the documentation.
- The “before” data in CDC events will only be populated for the parts of the table that are part of the
REPLICA IDENTITY. By default, in most situations, this means that only the primary key will be reflected in the “before” data. More details available in the documentation.
- Making changes to primary keys must be coordinated carefully to avoid issues with schema information. More details available in the documentation.
- CDC events are designed to be at least once delivery. You will need to build and configure your Kafka consumers to handle redundant CDC events gracefully.
UPDATEevents, unchanged TOASTed values will have a placeholder
__debezium_unavailable_value. If you do not account for this, you might end up using the placeholder as if it were the real value. Find more details in the documentation.
- Database rows with very large amounts of data cannot be produced into Kafka messages if they exceed the maximum message size for the topic (default: 1 MB). These changes will be logged (internally), but the event will not be produced (silently).
- If a connector stops replicating, the replication slot that tracks the connector’s progress will prevent WAL from being deleted. If the Postgres database runs out of disk for WAL, the database will stop entirely. Find more details in the documentation.
- Failures can occur for many reasons, leading the connector to stop. Some include a network partition, an AWS issue, a bug in Kafka Connect or Debezium, or an issue with our control plane. It is important that you monitor your database’s replication slots for these conditions.
- In certain failure modes, we may need to remove the replication slot to preserve the operational stability of the database. When conditions have cleared and the replication slot is created again, the connector will need to create a new snapshot because it will have lost track of the last change event it sent. This will result in additional load, as well as a large number of duplicate CDC events.
- Deleting a table will tombstone the corresponding Kafka topic, but not destroy it.
- If you deprovision a connector, we will not deprovision the associated Kafka topics. If you want to remove this data, then you must delete these topics yourself using the CLI:
heroku kafka:topics:destroy <topic_name> --app <app_name>. Find more details in the documentation.