faiss dense vector retrieval framework

Posted by mvleus on Thu, 24 Feb 2022 07:16:21 +0100

faiss is a framework that provides efficient similarity search and clustering for dense vectors. The following is the demo provided on the official website

# 1. First, build training data and test data
import numpy as np

d = 64  # dimension
nb = 100000  # database size
nq = 10000  # nb of queries
np.random.seed(1024)  # make reproduciable
xb = np.random.random(nb, d).astype("float32")  # The training data of [10000,64] and the shape is the query data of [10000,63]
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

# 2. Create an index. faiss creates an index and preprocesses vectors to improve query efficiency. faiss provides a variety of index methods. Here, choose the simplest index for violent retrieval of L2 distance: indexFlatL2
# When creating an index, you must specify the dimension d of the vector, and most indexes need to be trained. indexFlat2 skips this step.
import faiss

index = faiss.IndexFlatL2(d)  # build the index
print(index.is_trained)

# 3. After the index is created and trained (if necessary), you can execute the add and search methods. Add method usually adds samples during training, and search is to find similar vectors
index.add(xb)  # ad vectors to the index
print(index.ntotal)

# Pass in search vector to find similar vector

k = 4  # we want to seee 4 nearest neighbor
D, I = index.search(xq, k)  # Actual search represents the last four vectors of each query vector, D represents the distance dimension with similar vectors, and I represents the ID of similar users

print(I[:5])  # neighbors of the 5 first queries
print(D[-5:])  # neighbors of the 5 last queries

Speed up search

If too many vectors need to be stored, the speed of violent search index IndexFlat2 is very slow. The index indexvfflat (inverted file) of the search method is accelerated. In fact, K-means is used to establish clustering centers,
Then, by querying the nearest cluster center, and then comparing all vectors in the cluster, we can get similar vectors
When creating indexVFFlat, you need to specify another index as the quantizer to calculate the distance or similarity. The add method needs to be trained first
Parameter introduction:
faiss.METRIC_L2: faiss defines two measures of similarity, namely faiss METRIC_ INNER_ PRODUCT. One is the Euclidean distance, a vector inner product
nlist: number of cluster centers
k: Find the most similar K vectors
index.nprobe: find the number of cluster centers. The default is 1

nlist = 100  # Number of cluster centers
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.indexIVFFlat(quantizer, d, nlist,
                           faiss.METRIC_L2)  # here we specify METRIC_L2,by default it performs inner-product search
assert not index.is_trained
index.train(xb)
assert index.is_trained

index.add(xb)  # add may be a bit slower as well
D, I = index.search(xq, k)  # actual search
print(I[-5:])  # neighbors of the 5 last queries

index.nporbe = 10  # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:])

Method to reduce memory: use disk to store inverted indexes

The indexes indexFlatL2 and indexIVFFlat we saw above will store all vectors in memory. In order to meet the demand of large amount of data, faiss provides a compression algorithm based on product quantizer to encode the vector size to the specified number of bytes. At this time, the stored vector is compressed, and the query distance is also approximate.

nlist = 100
m = 8
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same

index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)  # 8 specifies that each sub_vector is encoded as 8 bits

index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k)  # sanity check
print(I)
print(D)
index.nprobe = 10  # make comparable with experiment above
D, I = index.search(xq, k)
print(I[-5:])

Simplified index representation

From the indexIVFFlat and indexIVVFPQ above, we can see that their construction needs to provide another index first. Similarly, faiss also provides pca, lsh and other methods. Sometimes they are used in combination, which will be more troublesome to construct the index. Faiss provides the way to construct the index through string expression. For example, the following expression can represent the above instance of creating indexIVFPQ

index = faiss.index_factory(d, "IVF100,PQ8")

reference resources:
Faiss tutorial details
faiss technology accumulation

Topics: Algorithm Machine Learning