Data analysis often comes down to comparing many objects with each other. It doesn’t pose many challenges when the objects are naturally ordered or can be measured with any metric. For instance, analyzing financial data is straightforward because all properties are numerical and can be easily manipulated.
However, things are getting harder once we try to analyze data that is not inherently numerical. Any modern data analysis faces this problem. There is no straightforward way to compare movies, songs, actors, PDF documents, images, or videos. Therefore, finding similarities is much harder, as we don’t have a simple method for finding neighbors or clustering objects.
To be able to perform such a comparison, we need to transform the objects into their numerical representation. The new representation is a vector (think of a list) of numbers representing various traits of the object. For instance, traits for a movie could include its star rating, duration in minutes, number of actors, or number of songs used in the movie. Once we build such a vector (called embedding) for each movie in our database, we can easily compare these vectors. However, doing that efficiently depends on the representation of the vectors and the technology stack we use.
Pgvector is an open source extension for PostgreSQL database that tries to address the problems we discussed and make the comparison and processing of vectors easy, fast, and standardized between technologies. By using pgvector we can turn the regular PostgreSQL instance into a database capable of processing vectors. Pgvector implements multiple algorithms, supports indexes, allows for tuning the performance of the operations, and does all of that directly in PostgreSQL. This gives us the ability to keep both the business data and the embeddings together, work on them in one environment, and monitor the performance accordingly. We can effectively turn PostgreSQL into a modern open-source vector database.
In the rest of this post we are going to see pgvector features, how it works, how it can be used to process embeddings, what benefits it brings, and how to get started. We’ll also analyze pgvector performance, database observability focused on vectors, and various pgvector applications.
Understanding Vector Databases and Vector Similarity Search
Vector databases, also known as vectorized databases or similarity search databases, are specialized databases designed to efficiently store, manage, and retrieve high-dimensional vector data. Unlike traditional relational databases that focus on structured data with rows and columns, vector databases are tailored to handle data represented as vectors or embeddings in a high-dimensional space. For the purpose of machine learning, vectors often represent embeddings
Vectors in this context are mathematical representations of data points in a multi-dimensional space. They are often used to encode complex data such as images, audio, text, and other types of unstructured or semi-structured data. Each dimension of the vector corresponds to a feature or attribute of the data point, and the values in the vector capture the importance or presence of those features. Technically, vectors are in a high-dimensional space and we want to reduce them to smaller vectors living in a low-dimensional space called embeddings. However, we can use vectors and embeddings interchangeably in this document.
The most important operation we want to perform on vectors is comparison and similarity check. We want to capture vectors representing all the objects we consider (like all movies or videos), and compare some vector to all the vectors in the database to find the most similar ones. This clearly shows why the traditional database is not suitable for such an operation. In the SQL world, we typically think in terms of rows and columns, and we compare rows based on the subset of columns only to look for an exact match. However, in the vector database we include all the values inside the vectors and look for the distance between them
Vector databases typically incorporate specialized indexing techniques and algorithms that optimize the search process for high-dimensional data. Examples of such algorithms include:
- KD-Tree: A data structure that partitions the data space into regions for efficient nearest neighbor searches.
- Random Projection: A technique that reduces dimensionality while preserving distances between points.
- Locality-Sensitive Hashing (LSH): A method that hashes similar points to the same bucket with high probability.
- Graph-Based Indexing: Constructing a graph where each node represents a data point and edges capture proximity relationships.
Vector databases have gained prominence with the rise of machine learning applications and the need to efficiently work with high-dimensional data. They enable fast and accurate retrieval of similar data points, contributing to improved user experiences and insights across various domains.
Pgvector combines the best of two worlds. It uses the SQL database under the hood, so it supports ACID, point-in-time recovery, partitioning, scalability, and other SQL features. It also supports exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance. Therefore, we can use pgvector along with our business data in the same modern database.
Key Features of pgvector
Pgvector is an extension for PostgreSQL. This makes it much easier to use than any other database because we don’t need to spin up a new solution. We can just take it off the shelf and integrate it with the database we already have. Additionally, we don’t need to migrate any data between the SQL database and the vector database. They can all live together in the same space which makes the integration much easier and more efficient. This is especially important if you ever built an ETL process backing the machine learning solution. Now we don’t need to migrate data between incompatible data sources.
PostgreSQL is a well-known and mature database. It supports ACID, role-based and row-level security, backups, partitioning, sharding, auditing, and much more. Therefore, we can use all the features required by the enterprise. This makes pgvector enterprise-ready immediately. We can use existing solutions and integrations, and only extend them with pgvector to store embeddings.
PostgreSQL supports all the features we expect from the SQL database. It supports joins, subqueries, window functions, stored procedures, triggers, and much more. Pgvector can be used with these easily.
Pgvector provides new column type and new operations to the engine. The column type is used to store vectors of as many as 2000 dimensions. We can then use new operators to perform calculations on the vectors. We can perform element-wise addition, subtraction, or multiplication. We can calculate the Euclidean, taxicab, or cosine distance of two vectors. We can calculate the inner product or the Euclidean norm. We can calculate the sum or average of vectors. Basically, we can run all the basic operations on vectors.
Pgvector also supports indexing, so we can improve the performance of the operations easily. There are two types of indexes: Inverted File (IVFFlat) and Hierarchical Navigable Small Worlds (HNSW).
IVFFlat index divides vectors into lists. The engine takes a sample of vectors in the database, clusters all the other vectors based on the distance to the selected neighbors, and then stores the result. When performing a search, pgvector chooses lists that are closest to the query vector, and then searches these lists only. Since IVFFlat uses the training step, it requires some data to be present in the database already when building the index. We need to specify the number of lists when building the index, so it’s best to create the index after we fill the table with data.
An HNSW index is based on a multilayer graph. It doesn’t require any training step (like IVFFlat), so the index can be constructed even with an empty database.HNSW build time is slower than IVFFlat and uses more memory, but it provides better query performance afterwards. It works by creating a graph of vectors based on a very similar idea as skip list. Each node of the graph is connected to some distant vectors and some close vectors. We enter the graph from a known entry point, and then follow it on a greedy basis until we can’t move any closer to the vector we’re looking for. It’s like starting in a big city, taking a flight to some remote capital to get as close as possible, and then taking some local train to finally get to the destination.
Over time, there will be more algorithms and solutions to calculate postgresql vector similarity. However, even today these features make the pgvector a very attractive solution for a vector database. We can load embeddings using well known tools, provide a high-availability with enterprise ready workflows we already have, and we can apply the optimization strategies we already know.
Benefits of Using pgvector
The biggest advantage of using pgvector over a dedicated vector database is the ability to keep all the data in the same database, and using the same workflows that we currently use in production. We don’t need to migrate to other solutions, deploy new databases, and build new operational procedures. That makes the adoption of pgvector much easier, as we can make pgvector production-ready immediately.
Speed and precision are important. Pgvector is relatively new, so there are not many benchmarks that would present the current performance as things change very fast. Some initial benchmark performed by NirantK indicated that pgvector is both slower and less precise than Qdrant. However, some other benchmark by Supabase indicates that pgvector can be easily scaled to match or even outperform Qdrant. Yet another benchmark by Jonathan Katz shows great improvements by using proper indexing.
Benchmarks show that we can apply the optimization strategies we already know and use daily. Thanks to that, pgvector becomes much more predictable than other database engines. This is a must-have for any system running in production because we can’t just go blind when things break in the production. We need to have well-known solutions that we can apply immediately and get improvements. There are plenty of materials talking about PostgreSQL performance tuning, and this applies to dealing with pgvector as well.
Another important point is that pgvector can be used basically anywhere. We don’t need to install new databases, we don’t need to learn new syntax or even use new libraries in our applications. We can just take what we have, add one extension, and have the full power of vector databases. This changes the landscape of vector databases and open source significantly, because they turn from “new approach” to “commodity”. Especially that PostgreSQL is hosted by multiple providers, both in cloud and on-premise.
Since pgvector is open source, we can read its internals, tune them to our needs, and deploy even in most limited contexts. We can use it in personal projects, big enterprise applications, and in government initiatives that require handling critical data. We can also get support from the community for free, and don’t need to acquire expensive licenses to start using it.
Real-world Use Cases
Vector databases are crucial for machine learning scenarios. The landscape of AI changes quickly. A couple of years ago we had to implement solutions ourselves. Later, cloud providers started introducing bigger building blocks that we could compose together to build the enterprise-ready applications. Now, we can take pretrained models, deploy them locally, and use them in a few minutes. Especially with the rise of ChatGPT, LLaMA, or OpenAI.
Instead of training a custom model, we can take a pre trained one, provide it with embeddings of our local knowledge, and then run the inference. For instance, we can take the large language model, extend it with documents describing the internals of our company, and then deploy the chatbot this way. In order to extend the pretrained model with our data, we need to be able to run a vector search on the embeddings of our documents. This is where pgvector can be used.
The real-world example would be as follows:
- Collect documents you would like to use to power the chatbot
- Calculate embeddings
- Store embeddings in PostgreSQLwith pvector
- Deploy OpenAI to handle the question from the user
- We use pgvector to find the documents closest to the embedding representing the question asked by the user
- We supply these documents as a context for OpenAI
- OpenAI pretrained model generates the answer
You can see that approach in action here.
Another pgvector use case would be a search engine for our internal knowledge database. We can turn each wiki page into an embedding, store them in the database, and then search efficiently based on the input. This would include the context of the query we ask. This would also provide understanding of how the things evolved over time since the pretrained models can understand the dependencies between documents and solutions.
Vector search is a foundation of many other operations. Finding documents, images, videos, classifying objects into categories, grouping objects together, translating between languages, and many more. Since this is just the beginning of the AI era, we’ll see many more applications coming in every single day.
Getting Started with pgvector
Pgvector is an extension to PostgreSQL, so using it is no different from using any other extension.
We start by cloning the extension repository. We then need to compile it and install it which comes down to:
Pgvector is experimentally supported on Windows as well, so you can use it anywhere.
Once you have the extension installed, you need to create it in the database:
If the extension is not loaded automatically, you may need to tune shared_preload_libraries and restart the database. Tread carefully when doing that in your production database.
Once you have the extension installed, you can create a table with a column for storing the embeddings:
You can then insert vectors:
Finally, you can start querying the data by using new operators provided by the extension:
When doing that in production, make sure that you don’t take your database down during peak hours. Plan ahead and prepare for the installation carefully to avoid issues.
After installation, make sure that you keep the extension up to date. It is evolving very quickly, and new features and improvements are coming every day. Unfortunately, there are no automated solutions for keeping your extensions up to date yet, so you need to plan that manually.
If you use a PostgreSQL database in a cloud, make sure that you can install the extension there. For instance, AWS RDS introduced support for pgvector recently. This makes experimentation much easier, as you don’t even need to compile the extension on your own.
Vector databases powering the new machine learning solutions will become a commodity in the future. Thanks to pgvector, we don’t need to deploy a new database to start using the full power of vector search. We can install pgvector in the existing instance of our PostgreSQL server, no matter if we run in the cloud or on-premise.
Pgvector is evolving very quickly. New features are coming every day, and we can already see that it matches the performance and precision of other vector databases. Since pgvector is open source, we can deploy it even in the most critical environments. We think that it will become a common part of the engineering ecosystem in the future, just like other cloud components that we use every day.
We can use pgvector to power machine learning solutions in our companies right now. Deploying a chat bot or a search engine aware of all the internals of our companies has never been easier.
What is pgvector, and how does it revolutionize similarity search on PostgreSQL?
Pgvector is an extension for PostgreSQL that provides vector database capabilities. It provides another column type for storing embeddings, and implements operators for running vector search.
How do vector-based similarity searches differ from traditional methods in databases?
Vector search algorithms are based on comparing vectors based on their distance. It’s not a regular comparison of a column within a table, but rather a way of ordering things based on their similarity and choosing the most similar ones.
What is the underlying technology and mechanics behind pgvector's functionality?
Pgvector supports two types of indexing: IVFFlat and HNSW. It supports
How does pgvector impact the efficiency and speed of searching large datasets?
Pgvector can be configured to balance the performance and precision. By using different indexes or different indexing parameters, we can scale the database easily.
Are there any use cases or industries that particularly benefit from pgvector in PostgreSQL?
Machine Learning is the most prominent domain for vector search. Chatbots, search engines, or knowledge mapping greatly benefits from vector search.
What are the requirements for implementing and using pgvector in a PostgreSQL database?
Pgvector can be used with any PostgreSQL instance, with cloud or on-premise, locally or in docker.
Can users with limited technical expertise utilize pgvector effectively, or does it require specialized knowledge?
Pgvector makes vector search easily and affordable for any user of the SQL database. Thanks to using the same mechanisms and SQL language, any user can start playing with vector databases right away.