What is an index
An index is a data structure that is supposed to speed up searching and data retrieval. An index can be created on one or more columns of a table, and it stores the values of those columns in a specific order. When you create an index on a column or set of columns, PostgreSQL stores the index values in a separate data structure that is optimized for fast search and retrieval. However, since this is an additional data structure that needs to maintain some invariants (for instance the order of values), data modification may be slower.
There are several types of indexes available in PostgreSQL, including B-tree, hash, and full-text search indexes. The type of index you choose will depend on your specific needs and the data you are working with. For instance, the B-tree index is a type of index that stores ordered values in a B-tree which is a special type of a balanced tree. Since the values are ordered, it is easy to search, however, insertion may be slower, as values need to be kept in order. Other indexes apply different optimization techniques to address some other specific use cases.
When you run a query that filters or sorts on a column with an index, PostgreSQL can use the index to quickly locate the rows that match your search criteria. This can significantly improve the performance of your queries, particularly if you are working with large datasets or if you are running complex queries that would otherwise require a full table scan. However, not all search queries can be improved. For example, an index storing values in a sorted order can easily locate strings starting with a particular character, but will be of no help for queries that look for a specific suffix.
Overall, indexes can be a powerful tool for improving the performance of your queries and making it faster and easier to retrieve data from your PostgreSQL database. However, they are not free and need to be configured with actual usage patterns in mind, and the performance should be verified later on.
What makes a powerful PostgreSQL JSONB data type
PostgreSQL introduced the JSONB type (new data type in an RDBMS). That format lets you store blobs in an efficient way like NoSQL databases, without requiring re-parsing whenever you want to access a field. It also gives flexibility to store data of any shape, without defining the schema in advance.
JSON is a textual form. It is especially important to configure proper indexes on heavily used JSONB columns to avoid reparsing the data on every access.
How to use indexes with JSONB
For JSONB column, you have a few options for indexing:
- GIN/GIST index: This allows you to index the contents of JSONB documents and perform fast searches and queries on that data.
- B-tree index: A B-tree index is a general-purpose index that can be used with any data type, including JSONB. B-tree indexes can be a good choice if you need to index JSONB data and perform a wide range of search operations on it.
- Hash index: Hash indexes are fast for searching for exact matches, but they do not support range queries or sorting. They are not a good choice for indexing JSONB data.
The best choice of index for JSONB data will depend on your specific needs and the types of searches you need to perform on your data. If you need to support a wide range of search operations, a B-tree index may be a good choice. If you primarily need to search for specific values within the JSONB documents, a GIN index may be a better choice.
What is a GIN/GIST Index
There are two special indexes for complex data types like arrays, JSON, and hstore (key-value pairs). These are GiST (Generalized Search Tree) and GIN (Generalized Inverted Index). They are similar in this sense that both of them improve the performance of data search and data retrieval.
However, there are some key differences between GiST and GIN indexes:
- Search capabilities: GiST indexes are designed to support a wide range of search operations, including full-text search, spatial search, and more. GIN indexes, on the other hand, are primarily designed to support fast search and retrieval of specific values within complex data types.
- Index size: GiST indexes can be larger than GIN indexes because they store more detailed information about the data they are indexing. This can make them slower to build and update, but can also make them more flexible and powerful for certain types of searches.
- Performance: In general, GIN indexes are faster than GiST indexes for searching for specific values within complex data types, but GiST indexes may be faster for certain types of searches such as full-text search or spatial search.
The choice between GiST and GIN indexes will depend on your specific needs and the types of searches you need to perform on your data. If you need to support a wide range of search operations, GiST indexes may be a good choice. If you primarily need to search for specific values within complex data types, GIN indexes may be a better choice.
GIN indexes can be a good choice for indexing complex data types and improving the performance of searches and queries on those data. However, as with any index type, there are trade-offs to consider. GIN indexes can be slower to build and update than other index types, and they may not be the best choice for every situation.
GIN case study
We examined a case of filtering a view based on several fields with GIN index. We wanted to perform filtering with a “simple” boolean check like the one below:
To our surprise, we found that the database loaded 10GB of data to perform the operation. This happened because JSONB is stored in a binary format. Despite having a GIN index, the database has to fetch all the data.
We decided to change the way we load the data. Instead of indexing the whole JSONB column, we added additional columns to hold the relevant flags from the JSON we were going to use, and we configured BTREE index on these columns. We then applied partitioning and moved raw JSON to another table to read it only when needed.
Lesson learned and summary
Indexes can make your queries run faster. However, when creating indexes, you should think of the application’s scaling. You need to know your query structure, load your tables with dummy data, and check what changes you can do for better performance.
PostgreSQL has many different indexes to choose from. Knowing when to use what index is crucial. The advantages of using indexes rather than a separate technology such as elasticsearch is that you have full control over how you want to search and maybe lessen your application's development and maintenance cost.
However, indexes need to be verified. We should check the actual usage of an index (if filtering criteria can utilize the index), and whether the index improves the performance overall (including the increased cost of data modification).
Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value.
PostgreSQL offers powerful insights into what’s going on inside the engine. You can use Metis to easily integrate observability solutions into your platform.