High Availability on Heroku Postgres
Last updated November 21, 2022
All primary databases on Premium, Private and Shield tier plans come with the High Availability (HA) feature. This database cluster and management system is designed to increase database availability in the face of hardware or software failure that would otherwise lead to longer downtime. When a primary database with this feature fails, it’s automatically replaced with another replica database called a standby.
Follower databases on Premium, Private and Shield tier plans don’t have a hidden standby until they unfollow their leader database. If high availability is required on follower databases, set up multiple followers.
The database instance that exhibited failure is consequently destroyed and the standby is reconstructed.
When this happens, it’s possible for a small, but bounded, amount of recently committed data to be lost.
The values of your
HEROKU_POSTGRES_*_URL config vars can change when a failover event happens. If you’re connecting to this database from outside of Heroku, make sure you set your credentials correctly.
Like followers, HA standbys are physically located in a different availability zone (AZ) to protect against AZ-wide failures.
The standby node is hidden from your application. If you need followers for horizontal read scaling or reporting, create a new Standard Tier follower database of your primary.
In order to prevent problems commonly seen with hair-trigger failover systems, we run a suite of checks to ensure that failover is the appropriate response. These checks are run every few seconds and consist of establishing connections to the underlying host using the SSH protocol. However, when only the PostgreSQL process becomes unavailable for any reason, a failover is unnecessary and the process is simply booted back into availability instead, ensuring an even shorter downtime period whenever possible.
After our systems initially detect a problem, we confirm that the database is truly unavailable by running several checks for two minutes across multiple network locations. This prevents transient problems from triggering a failover.
Like followers, standbys are kept up to date asynchronously. This means that it’s possible for data to be committed on the primary database but not yet on the standby. In order to minimize data loss we take two important steps:
- We don’t attempt the failover if the standby is more than 10 segments behind. This means the maximum possible loss is 160 MB or 10 minutes, whichever is less.
- If any of the 10 segments were successfully archived through continuous protection, but not applied during the two minute confirmation period, we make sure they’re applied before bringing the standby out of read-only mode.
Typically there’s little loss to committed data.
Out of memory conditions and exhausting concurrent connections aren’t treated as failover conditions. These conditions are caused by application behaviour, and would be likely to persist across failovers.
After a successful failover, there are a few things to keep in mind:
- The URL for the database will have changed, and your app will automatically restart with the new credentials.
- The new database’s cache will be cold, so your application’s performance may be degraded for a short period of time. This will fix itself through normal usage.
- A new standby is automatically created, and HA procedures cannot be performed until it becomes available and meets our failover conditions.
- Any Postgres sequences being used, such as those for integer primary keys, may see a gap after a failover event due to the way sequences are replicated in Postgres itself.
- Standard followers of your primary database are destroyed and recreated when the failover event happens. Followers on Premium, Private and Shield plans are repointed to the correct database. If the repoint fails, the follower is destroyed and recreated.
You can check the status of HA for your database by running
heroku pg:info. In normal situations, it will show
HA Status: Available. After unfollowing or after a failover event, it will show
HA Status: Temporarily Unavailable while the standby is being rebuilt. It can also show ‘Temporarily Unavailable’ when the standby is more than 10 segments behind, as failover will not be attempted at that time.