Efficient Use of PostgreSQL Indexes
Last updated December 28, 2022
Table of Contents
There are many types of indexes in Postgres, as well as different ways to use them. In this article, we give an overview of the types of indexes available, and explain different ways of using and maintaining the most common index type: B-trees.
An index is a way to efficiently retrieve a relatively small number of rows from a table. It’s only useful if the number of rows to be retrieved from a table is relatively small (that is, the condition for retrieving rows - the WHERE clause - is selective). B-tree indexes are also useful for avoiding sorting.
Postgres supports many different index types:
- B-Tree is the default that you get when you do
CREATE INDEX. Virtually all databases have some B-tree indexes. B-trees attempt to remain balanced, with the amount of data in each branch of the tree being roughly the same. Therefore the number of levels that must be traversed to find rows is always in the same ballpark. B-tree indexes can be used for equality and range queries efficiently. They can operate against all datatypes, and can also be used to retrieve NULL values. B-trees are designed to work very well with caching, even when only partially cached.
- Hash Indexes pre-Postgres 10 are only useful for equality comparisons, but you never want to use them since they aren’t transaction safe, must be manually rebuilt after crashes, and aren’t replicated to followers. So, the advantage over using a B-tree is rather small. In Postgres 10 and above, hash indexes are now write-ahead logged and replicated to followers.
- Generalized Inverted Indexes (GIN) are useful when an index must map many values to one row. Whereas B-tree indexes are optimized for when a row has a single key value. GINs are good for indexing array values as well as for implementing full-text search.
- Generalized Search Tree (GiST) indexes allow you to build general balanced tree structures, and can be used for operations beyond equality and range comparisons. They’re used to index the geometric data types, as well as full-text search.
This article is about how to get the most out of default B-Tree indexes. For examples of GIN and GiST index usage, refer to the contrib packages.
Why Is My Query Not Using an Index?
There are many reasons why the Postgres planner can choose to not use an index. Most of the time, the planner chooses correctly, even if it isn’t obvious why. It’s okay if the same query uses an index scan on some occasions but not others. The number of rows retrieved from the table can vary based on the particular constant values the query retrieves. So, for example, it’s correct for the query planner to use an index for the query
select * from foo where bar = 1, and yet not use one for the query
select * from foo where bar = 2 if there happened to be far more rows with “bar” values of 2. When this happens, a sequential scan is most likely much faster than an index scan, so the query planner has in fact correctly judged that the cost of performing the query that way is lower.
A partial index covers just a subset of a table’s data. It’s an index with a WHERE clause. The idea is to increase the efficiency of the index by reducing its size. A smaller index takes less storage, is easier to maintain, and is faster to scan.
For example, suppose you allow users to flag comments on your site, which in turn sets the
flagged boolean to true. You then process flagged comments in batches. You want to create an index like so:
CREATE INDEX articles_flagged_created_at_index ON articles(created_at) WHERE flagged IS TRUE;
This index remains fairly small, and can also be used along other indexes on the more complex queries that require it.
Expression indexes are useful for queries that match on some function or modification of your data. Postgres allows you to index the result of that function so that searches become as efficient as searching by raw data values. For example, you can require users to store their email addresses for signing in, but you want case insensitive authentication. In that case, it’s possible to store the email address as is, but do searches on
WHERE lower(email) = '<lowercased-email>'. The only way to use an index in such a query is with an expression index like so:
CREATE INDEX users_lower_email ON users(lower(email));
Another common example is for finding rows for a given date, where we’ve stored timestamps in a datetime field but want to find them by a date casted value. An index like
CREATE INDEX articles_day ON articles ( date(published_at) ) can be used by a query containing
WHERE date(articles.published_at) = date('2011-03-07').
A unique index guarantees that the table doesn’t have more than one row with the same value. It’s advantageous to create unique indexes for two reasons: data integrity and performance. Lookups on a unique index are very fast.
In terms of data integrity, using a
validates_uniqueness_of validation on an ActiveModel class doesn’t really guarantee uniqueness because there can and are concurrent users creating invalid records. Therefore, always create the constraint at the database level - either with an index or a unique constraint.
There’s little distinction between unique indexes and unique constraints. Unique indexes can be thought of as lower level since expression indexes and partial indexes can’t be created as unique constraints. Even partial unique indexes on expressions are possible.
While Postgres can create multi-column indexes, it’s important to understand when it makes sense to do so. The Postgres query planner can combine and use multiple single-column indexes in a multi-column query by performing a bitmap index scan. In general, you can create an index on every column that covers query conditions and in most cases, Postgres will use it. So, make sure to benchmark and justify the creation of a multi-column index before you create one. As always, indexes come with a cost, and multi-column indexes can only optimize the queries that reference the columns in the index in the same order, while multiple single column indexes provide performance improvements to a larger number of queries.
However, there are cases where a multi-column index clearly makes sense. An index on columns
(a, b) can be used by queries containing
WHERE a = x AND b = y, or queries using
WHERE a = x only, but aren’t used by a query using
WHERE b = y. So if this matches the query patterns of your application, the multi-column index approach is worth considering. Also, note that in this case creating an index on
a alone would be redundant.
B-trees and Sorting
B-Tree index entries are sorted in ascending order by default. In some cases, it makes sense to supply a different sort order for an index. Take the case when you’re showing a paginated list of articles, sorted by most recent published first. We can have a
published_at column on our
articles table. For unpublished articles, the
published_at value is NULL.
In this case, we can create an index like so:
CREATE INDEX articles_published_at_index ON articles(published_at DESC NULLS LAST);
In Postgres 9.2 and above, it’s of note that indexes aren’t always required to go to the table, provided we can get everything needed from the index (that is, no unindexed columns are of interest). This feature is called “Index-only scans”.
Since we’re querying the table in sorted order by
published_at and limiting the result, we get some benefit out of creating an index in the same order. Postgres finds the rows it needs from the index in the correct order and then goes to the data blocks to retrieve the data. If the index wasn’t sorted, there’s a good chance that Postgres would read the data blocks sequentially and sort the results.
This technique is mostly relevant with single column indexes when you require “nulls to sort last” behavior because otherwise, the order is already available since an index can be scanned in any direction. It becomes even more relevant when used against a multi-column index when a query requests a mixed sort order, like
a ASC, b DESC.
Managing and Maintaining Indexes
Indexes in Postgres don’t hold all row data. Even when an index is used in a query and matching rows where found, Postgres goes to disk to fetch the row data. Additionally, row visibility information (discussed in the MVCC article) isn’t stored on the index either, therefore Postgres must also go to disk to fetch that information.
Having that in mind, you can see how in some cases using an index doesn’t really make sense. An index must be selective enough to reduce the number of disk lookups for it to be worth it. For example, a primary key lookup with a large enough table makes good use of an index: instead of sequentially scanning the table matching the query conditions, Postgres is able to find the targeted rows in an index, and then fetch them from disk selectively. For very small tables, for example, a cities lookup table, an index can be undesirable, even if you search by city name. In that case, Postgres can decide to ignore the index in favor of a sequential scan. Postgres decides to perform a sequential scan on any query that hits a significant portion of a table. If you do have an index on that column, it’s a dead index that’s never used - and indexes aren’t free: they come at a cost in terms of storage and maintenance.
For more on running production, staging, and other environments for your Heroku application, see our Managing Multiple Environments article.
When tuning a query and understanding what indexes make the most sense, be sure to use a database as similar as possible to what exists, or will exist in production. Whether an index is used or not depends on a number of factors, including the Postgres server configuration, the data in the table, the index, and the query. For instance, trying to make a query using an index on your development machine with a small subset of “test data” is frustrating: Postgres determines that the dataset is so small that it’s not worth the overhead of reading through the index and then fetching the data from disk. Random I/O is much slower than sequential, so the cost of a sequential scan is lower than that of the random I/O introduced by reading the index and selectively finding the data on disk. Performing index tuning must be done on production, or on a staging environment that is as close to production as possible. On the Heroku Postgres database platform it’s possible to copy your production database to a different environment easily.
When you’re ready to apply an index on your production database, keep in mind that creating an index locks the table against writes. For large tables that can mean your site is down for hours. Fortunately, Postgres allows you to
CREATE INDEX CONCURRENTLY, which takes much longer to build but doesn’t require a lock that blocks writes. Ordinary
CREATE INDEX commands require a lock that blocks writes but not reads.
Finally, indexes will become fragmented and unoptimized after some time, especially if the rows in the table are often updated or deleted. In those cases, it can be required to perform a
REINDEX leaving you with a balanced and optimized index. However, be cautious about reindexing large indexes as write locks are obtained on the parent table. One strategy to achieve the same result on a live site is to build an index concurrently on the same table and columns but with a different name, and then drop the original index and rename the new one. This procedure, while much longer, doesn’t require any long running locks on the live tables.
Postgres provides a lot of flexibility when it comes to creating B-tree indexes that are optimized to your specific use cases, as well as options for managing the ever-growing database behind your applications. These tips help you keep your database healthy, and your queries snappy.