faiss is a framework that provides efficient similarity search and clustering for dense vectors. The following is the demo provided on the official website
# 1. First, build training data and test data import numpy as np d = 64 # dimension nb = 100000 # database size nq = 10000 # nb of queries np.random.seed(1024) # make reproduciable xb = np.random.random(nb, d).astype("float32") # The training data of [10000,64] and the shape is the query data of [10000,63] xb[:, 0] += np.arange(nb) / 1000. xq = np.random.random((nq, d)).astype('float32') xq[:, 0] += np.arange(nq) / 1000. # 2. Create an index. faiss creates an index and preprocesses vectors to improve query efficiency. faiss provides a variety of index methods. Here, choose the simplest index for violent retrieval of L2 distance: indexFlatL2 # When creating an index, you must specify the dimension d of the vector, and most indexes need to be trained. indexFlat2 skips this step. import faiss index = faiss.IndexFlatL2(d) # build the index print(index.is_trained) # 3. After the index is created and trained (if necessary), you can execute the add and search methods. Add method usually adds samples during training, and search is to find similar vectors index.add(xb) # ad vectors to the index print(index.ntotal) # Pass in search vector to find similar vector k = 4 # we want to seee 4 nearest neighbor D, I = index.search(xq, k) # Actual search represents the last four vectors of each query vector, D represents the distance dimension with similar vectors, and I represents the ID of similar users print(I[:5]) # neighbors of the 5 first queries print(D[-5:]) # neighbors of the 5 last queries
Speed up search
If too many vectors need to be stored, the speed of violent search index IndexFlat2 is very slow. The index indexvfflat (inverted file) of the search method is accelerated. In fact, K-means is used to establish clustering centers,
Then, by querying the nearest cluster center, and then comparing all vectors in the cluster, we can get similar vectors
When creating indexVFFlat, you need to specify another index as the quantizer to calculate the distance or similarity. The add method needs to be trained first
Parameter introduction:
faiss.METRIC_L2: faiss defines two measures of similarity, namely faiss METRIC_ INNER_ PRODUCT. One is the Euclidean distance, a vector inner product
nlist: number of cluster centers
k: Find the most similar K vectors
index.nprobe: find the number of cluster centers. The default is 1
nlist = 100 # Number of cluster centers k = 4 quantizer = faiss.IndexFlatL2(d) # the other index index = faiss.indexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2) # here we specify METRIC_L2,by default it performs inner-product search assert not index.is_trained index.train(xb) assert index.is_trained index.add(xb) # add may be a bit slower as well D, I = index.search(xq, k) # actual search print(I[-5:]) # neighbors of the 5 last queries index.nporbe = 10 # default nprobe is 1, try a few more D, I = index.search(xq, k) print(I[-5:])
Method to reduce memory: use disk to store inverted indexes
The indexes indexFlatL2 and indexIVFFlat we saw above will store all vectors in memory. In order to meet the demand of large amount of data, faiss provides a compression algorithm based on product quantizer to encode the vector size to the specified number of bytes. At this time, the stored vector is compressed, and the query distance is also approximate.
nlist = 100 m = 8 k = 4 quantizer = faiss.IndexFlatL2(d) # this remains the same index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8) # 8 specifies that each sub_vector is encoded as 8 bits index.train(xb) index.add(xb) D, I = index.search(xb[:5], k) # sanity check print(I) print(D) index.nprobe = 10 # make comparable with experiment above D, I = index.search(xq, k) print(I[-5:])
Simplified index representation
From the indexIVFFlat and indexIVVFPQ above, we can see that their construction needs to provide another index first. Similarly, faiss also provides pca, lsh and other methods. Sometimes they are used in combination, which will be more troublesome to construct the index. Faiss provides the way to construct the index through string expression. For example, the following expression can represent the above instance of creating indexIVFPQ
index = faiss.index_factory(d, "IVF100,PQ8")
reference resources:
Faiss tutorial details
faiss technology accumulation