Application and principle of pgvector

KlustronAbout 11 min

Application and principle of pgvector

Introduction

As large language models like ChatGPT become increasingly prevalent, vector databases, touted as a potent tool to enhance the knowledge base of ChatGPT, have been thrust into the spotlight. Specifically, pgvector, a plugin for PostgreSQL, furnishes the system with vector storage and query capabilities, propelling PostgreSQL to the forefront of powerful vector databases. In this article, we will introduce the usage of pgvector and the principles behind its implementation, hoping to give everyone an intuitive understanding of vectorized databases.

Key Takeaway: As a plugin for PostgreSQL, pgvector offers the capability of vector storage and querying, making PostgreSQL leap to become a powerful vector database.

Embedding

Before delving into the plugin, let's first discuss a mathematical concept known as "embedding." It is a data processing technique often employed to map intricate data structures, such as text, images, and audio, onto lower-dimensional vectors (referred to as "embedding vectors" in the following sections). Each dimension of the embedding vector corresponds to a latent feature of the original data, making this mapping akin to "embedding" the characteristics of the complex original data structure into the vector. The nature of these embedding vectors allows us to gauge the similarity between complex data structures by calculating the distance between vectors, using metrics like the L2 distance, cosine distance, etc.

Pgvector equips PostgreSQL with the capability to persist these embedding vectors, while also introducing an index for expedited "similarity vector lookup." This lookup refers to identifying the Top N vectors that are closest to a given vector from a pool of persisted embedding vectors. This feature is pivotal in various real-world scenarios, including recommendation systems, facial recognition, large language models, and more. Consequently, the acceleration provided by pgvector holds tangible significance.

Use Cases

Before diving deep into pgvector's mechanics, let's first explore two practical applications to offer a clearer understanding of its potential.

Our first example showcases the synergy between pgvector and ChatGPT in creating an intelligent customer service chatbot for a company.

As we know, ChatGPT is a versatile model, trained on millions of documents, granting it the ability to comprehend human language and answer queries based on the information it received during training. However, its knowledge base is not updated in real-time. So, if you pose questions about the latest news or a product it hasn’t encountered before, it might provide an inaccurate answer. Considering ChatGPT's stellar linguistic understanding, a feasible approach would be to feed relevant documents or data as context for ChatGPT’s responses, instantly transforming it into an "expert" on the given product. But two challenges arise: First, there's a size limit to the context that ChatGPT can handle. If the document or data surpasses this threshold, ChatGPT might falter due to missing information. Second, ChatGPT operates on a token-based pricing system, so continuously feeding it with full documents can be uneconomical. Therefore, it becomes essential to selectively pick relevant documents as context based on user queries. The following flowchart illustrates a comprehensive solution:

Pre-processing Workflow:

Start by segmenting the product documentation into a series of sub-documents based on topics.
Use ChatGPT's text embedding model to embed each sub-document, obtaining the corresponding embedding vectors.
Persist both the sub-documents and their respective embedding vectors in a PostgreSQL database equipped with the pgvector plugin.

User Query Processing Workflow:

Utilize ChatGPT's text embedding model to embed the user's query, producing the query's embedding vectors.
With the query's embedding vectors, perform a similarity vector search in the PostgreSQL database to identify the most closely matched sub-document.
Construct a prompt combining the sub-document, the user's query, and any additional role-guidance information, and then submit it to ChatGPT.
Relay ChatGPT's response to the user.

Following the steps above, we can establish an economically viable intelligent customer service bot! In this example, the PostgreSQL instance with the pgvector plugin manages the persistence of the segmented sub-documents and their embedding vectors. It offers rapid similarity vector search capabilities, supplying ChatGPT with the most relevant context.

The second example is a fascinating mini-application, complete with code for readers to experiment on their own. The project can be found at [link[XY1] ].

This instance utilizes the deep learning model "CLIP", launched by OpenAI in 2021. A unique feature of the Clip model is its ability to embed both images and text. More intriguingly, the generated embedding vectors for both modalities share the same embedding space. This means that if a text descriptor correlates with an image, their respective embedding vectors will be proximate. In this case study, the Clip model facilitates an application where, given a sentence, the most relevant footage from surveillance videos corresponding to that sentence can be identified. The following figure illustrates the structure of this application:

This application performs frame sampling on a surveillance video, capturing a series of frames or snapshots. These snapshots are then embedded using the image encoder provided by the CLIP model, yielding corresponding embedding vectors. We then persist these frames (or their frame numbers) and embedding vectors in PostgreSQL. When a user wishes to search within the surveillance video—for instance, to locate a missing child—they can describe the child's features in text. The application utilizes the text encoder provided by the CLIP model to embed this description, generating an embedding vector for the text. This vector is used to perform a similarity vector search in the database, producing the image that best matches the description. As can be observed, the logic of this application mirrors that of the earlier-discussed intelligent customer service bot. This underscores that similarity vector search has a broad range of use cases.

Next, let's focus on how this project leverages the functionalities provided by the pgvector plugin.

Firstly, a table named video_features is created in the code. This table is used by the project to persist the surveillance snapshots (though in actual implementation, it may only record the frame numbers) and their embedding vectors. The feature column is designated to store the embedding vectors, which are of vector type with a dimension of 1024. This vector type is a new data type introduced to PostgreSQL by the pgvector plugin.

cur.execute(
    """CREATE TABLE IF NOT EXISTS vedio_features(
               id bigserial primary key, 
               image bytea, 
               feature vector(1024));"""
)

As shown in the code below, an index is created on the feature column of the video_features table to accelerate similarity vector searches. This index type is ivfflat, which is one of the new index types introduced to the PostgreSQL database by the pgvector plugin. The index method used is vector_cosine_ops, indicating the use of cosine distance to measure the dissimilarity between vectors. But what is lists? Let's keep that a mystery for now; we will delve deeper into the principles of pgvector later.

cur.execute(
    """CREATE INDEX ON vedio_features 
               USING ivfflat (feature vector_cosine_ops) WITH (lists = 100);"""
)

Following that, we have the code that performs a similarity vector search in the PostgreSQL database using text embedding vectors. The operator <=>, introduced by the pgvector plugin, calculates the cosine distance between two vectors. Thus, the SQL in the code sorts the surveillance snapshots based on their cosine distance to the provided text embedding vectors, retrieving the top 5 snapshots that best match the text description. Without the support of an index, this SQL query would have to traverse all the embedding vectors, which would be highly inefficient with a large amount of data. However, with the help of the previously created ivfflat index, the database can now swiftly respond to user queries.

 cur.execute(
        """SELECT image FROM vedio_features 
            ORDER BY feature <=> %s::vector limit 5;""",
        (np.array(search_feat.detach()).tolist(),),
    )

Implementation Principles

Through the two examples presented earlier, readers should now have an intuitive understanding of pgvector. In essence, the pgvector plugin offers three primary functionalities:

• Introduces a new data type called vector;

• Introduces operators to calculate distances between vectors;

• Introduces indexes to speed up similarity vector searches.

In the following sections, we will further explore how pgvector implements these functionalities, providing readers with a deeper appreciation of the extensibility of PostgreSQL databases.

Data Type

As a highly extensible database, PostgreSQL allows users to define their own data types, operators, functions, indexes, and even optimizers and executors. For custom data types, if you can simply combine existing types to get the desired new type, the provided create type syntax in PostgreSQL can create it in one step. For more complex data types, like the vector type introduced by pgvector, it can also be created in just a few simple steps:

1、Define the Memory Structure, Textual and Binary Representations of the Vector.

First, regarding the memory structure: if it's a fixed-length structure, you can define it as desired as long as the member variables aren't pointers. For variable-length structures like vector, you need to adhere to PostgreSQL’s default memory layout standards for variable-length data types to ensure compatibility. Specifically, the first field of the structure needs to be a one-byte or four-byte header to encode the structure's length, compression status, etc. Hence, the first field of the vector is of type int32, and subsequent fields contain unique vector data. PostgreSQL comes with many macros for handling variable-length data types (e.g., setting and getting length).

typedef struct Vector {
 int32 vl_len_; /* varlena header (do not touch directly!) */
 int16 dim; /* number of dimensions */
 int16 unused;
 float x[FLEXIBLE_ARRAY_MEMBER];
} Vector;

Applications typically interact with databases via text-based SQL or a secure binary protocol (i.e., prepare/execute). Therefore, every data type must also have corresponding textual and binary representations. The text representation of a vector is a series of floating-point numbers enclosed in square brackets, such as "[0.1221, 1.312, -3.31]". The binary format is a byte sequence that sequentially stores the binary data of the vector structure's dim, unused, and x fields.

2、Define Functions to Convert Between the Vector Memory Structure and its Textual or Binary Representations.

3、Define Functions to Parse Type Modifiers (Typmod).

PostgreSQL allows for type modifiers, which provide additional information for further defining the type, such as the "12" in vector(12), which determines the vector's dimensional information. Since PostgreSQL doesn't restrict the length of the modifier but requires it to be encoded as an int stored in system tables, developers need to define corresponding functions to parse these type modifiers and encode them as integer values.

4、Officially Define the Vector Type Using the create type Statement.

The create type statement specifies not only the new type's name but also various conversion functions defined in step 2. Based on these conversion functions, the PostgreSQL database creates and represents the vector data type. Other relevant details and information are also provided.

CREATE TYPE vector (
 INPUT = vector_in, # Convert from text to vector
 OUTPUT = vector_out, # Convert vector to text format
 RECEIVE = vector_recv, # Convert from binary to vector
 SEND = vector_send, # Convert vector to binary format
 TYPMOD_IN = vector_typmod_in, # Parse type modifier, e.g., the "12" in vector(12)
 STORAGE = extended # Indicates that it can reside in the main table when its length is short; otherwise, it's stored externally
);

5、 Define Type Conversion Functions.

After the preceding steps, the vector type has been successfully created. However, one problem remains: PostgreSQL doesn't fully understand the true meaning of the type and its dimensions. If users insert data with dimensions that don't match the column definition, PostgreSQL mechanically creates a memory structure and persists it directly to table files without checking if dimension information is correct. For example, the following SQL won't throw an error when executed:

CREATE TABLE t (a vector(2));
 INSERT INTO t VALUES ('[0.12, -0.213, 0.782]');

To solve this issue, you need to define a type conversion function for the vector type itself. This is because when PostgreSQL receives data with type modifiers, it's uncertain whether the data conforms to the type modifier's constraints. It then attempts to find and call a conversion function for that type, passing in the target type's modifier to check or adjust the data to fulfill the intended meaning. Therefore, pgvector also defined a conversion function for the vector type. For more details on create cast, you can refer to the official PostgreSQL documentation.

CREATE CAST (vector AS vector)
	WITH FUNCTION vector(vector, integer, boolean) AS IMPLICIT;

Operators

To facilitate more sophisticated operations on data of the vector type beyond basic reading and writing, it's necessary to define functions or operators associated with the vector type. In the PostgreSQL database, creating custom functions or operators is straightforward using the create function/operator syntax. Detailed information can be found on the official PostgreSQL website. The pgvector extension defines a series of functions and operators for the vector type. Among them, three functions and operators are related to vector distances:

Distance Type	Operator	Function
L2 / Euclidean (Length of the new vector obtained by subtracting two vectors)	<->	l2_distance
Negative Inner product (Negative value of the inner product of two vectors)	<#>	vector_negative_inner_product
Cosine (Cosine value of the angle between two vectors)	<=>	cosine_distance

With these operators and functions defined, one can execute SQL queries to perform similarity vector searches. For example:

-- Using the operator:
SELECT * FROM video_features ORDER BY feature <=> '[0.212, 0.312, 0.312]' LIMIT 5;
-- Using the function:
SELECT * FROM video_features ORDER BY cosine_distance(feature, '[0.212, 0.312, 0.312]') LIMIT 5;

Indexes

When executing the SQL for similarity vector searches as mentioned in the previous section, it's necessary to traverse all the data. With a large dataset, performance can be a significant issue. So, how can the execution performance be enhanced? Given that embedded vectors usually have a high dimensionality, it's challenging to implement accurate and efficient similarity vector search algorithms. Consequently, prevalent search algorithms in the industry are "approximate", striking a balance between accuracy and speed. Currently, pgvector implements two approximate search algorithms for similarity vectors: the ivfflat algorithm and the hnsw algorithm. The corresponding indexes are named ivfflat index and hnsw index. In the following, we will focus on the implementation of the ivfflat algorithm.

ivfflat Algorithm

Core ideas behind the ivfflat algorithm are:

Randomly sample n*50 examples from the dataset, where n is the number of clusters.
Use the KNN algorithm to cluster the samples, resulting in centroids for each cluster.
Based on the distance to each centroid, divide the entire dataset into n large clusters.
When searching for similarity vectors, first calculate the distance between the input vector and the centroid of each cluster. Then, only search within the cluster that is closest to the input vector.

By doing so, there's no need to scan all data, significantly enhancing search performance. So, how is the number of clusters determined? This is decided by the lists attribute when creating the index. For instance, in the SQL statement shown, specifying "with (lists=100)" indicates 100 clusters.

CREATE INDEX ON <table name> USING ivfflat (<column name> <index method>) WITH (lists = 100);

It's important to note that using a higher lists value can significantly reduce the search space and improve query speed, but it may also lead to a drop in recall because some vectors might be excluded. Additionally, the overhead of comparing with the centroids will also increase proportionally. If you're unsure about the optimal lists value, here's a recommendation: For datasets smaller than one million, ensure each cluster contains 1,000 vectors. For datasets larger than one million, set the lists value to the square root of the dataset size.

Why is ivfflat considered an approximate similarity vector search algorithm? Consider a hypothetical diagram where a given vector, represented by a red dot, is being searched for similarity vectors. Ideally, the best similarity vector should be one located in the light green region near the boundary. However, since the red dot is in the light purple area, the search can only look for the closest vector within this region, thereby missing out on the best similarity vector.

To mitigate this issue, the ivfflat algorithm has a runtime parameter called probes, which controls the number of clusters to search. For example, setting probes to 2 would search the two clusters nearest to the input vector.

set ivfflat.probes = 2;

Physical Structure of the ivfflat Index

After understanding the principles of the ivfflat algorithm, let's take a look at how the physical structure of the ivfflat index is organized. As illustrated in the given diagram, the ivfflat index is divided into two levels. Each level is a unidirectional "linked list" comprised of several physical pages. The first level stores the centroid vectors of each cluster, with each centroid having a "pointer" pointing to the second level. The second level, on the other hand, stores all vectors within each cluster.

Considerations for ivfflat

When using the ivfflat index, consider the following points:

• When indexing an empty table, centroids for clusters are created randomly. Given that randomly assigned centroids can affect search accuracy, it is recommended to create the ivfflat index after data ingestion is complete.

• Once the centroids of clusters are determined, they won't change, even if the subsequent data undergoes drastic changes. If the distribution of data changes, it's recommended to use the reindex command to recreate the index.

• The ivfflat index can only be used for optimization in sorting. In the PostgreSQL database, indexes can be categorized into "search" and "order". For instance, the commonly seen btree index falls under the "search" category, used to quickly find tuples that meet certain filtering conditions. ivfflat belongs to the "order" category, specifically for sorting expressions of the format "constant OP indexed column". Here, OP refers to one of the three operators mentioned earlier for calculating vector distances. Even though pgvector offers functions equivalent to these operators, using these functions in an order by clause won't utilize the ivfflat index for optimization, even if these function names seem more straightforward.

• The returned result set might be smaller than expected. Since the ivfflat index will only scan the top n clusters closest to the input vector, even if the expected result set size exceeds the top n clusters, it will still return data only from those top n clusters. This seems to contradict the SQL standard. Since similarity vector searches typically only focus on the first few most similarity vectors, this won't pose a severe problem. Still, one needs to be aware of this characteristic, or they might be surprised by such results.

Conclusion

This article shared two examples of using pgvector and introduced the principles and implementation details of pgvector. It is hoped that readers will find value amidst the detailed text. Due to space constraints, some details, such as the various mechanisms provided by PostgreSQL for defining new index types and another index type in pgvector called hnsw, were not covered. These details will hopefully be analyzed in greater depth in subsequent shares.

Q&A

Q1: Is there a plan for heterogeneous computing on vectors, like using GPUs for accelerated vector computations?

PostgreSQL is a highly extensible database. If there's a huge computational demand related to vectors, it's entirely possible to customize specific operators to offload these calculations to the GPU. Implementing this in PostgreSQL is not complex.

Q2: How are the centroids for ivf selected?

The centroids are determined by sampling the original data and then using the KNN clustering algorithm. If the existing data amount is less than the number of clusters, random centroid vectors will be generated.

Q3: If the number of vectors in each cluster region is large, the search could be slow. How do you address this?

The ivfflat index can adjust the lists and probes parameters to reduce the vectors being searched. Alternatively, you can try another index in pgvector, hnsw. Compared to the ivfflat index, the number of vectors it searches is much fewer.

Q4: After creating the index, do you advise against continuing to insert data?

It's not about advising against inserting more data. It's more about the recommendation to rebuild the index if there's a significant change in the data.

Application and principle of pgvector

# Application and principle of pgvector

# Introduction

# Embedding

# Use Cases

# Implementation Principles

# Data Type

# Operators

# Indexes

# ivfflat Algorithm

# Physical Structure of the ivfflat Index

# Considerations for ivfflat

# Conclusion

# Q&A

Application and principle of pgvector

Introduction

Embedding

Use Cases

Implementation Principles

Data Type

Operators

Indexes

ivfflat Algorithm

Physical Structure of the ivfflat Index

Considerations for ivfflat

Conclusion

Q&A