faiss-6: index pre-processing and post-processing

Posted by w32 on Tue, 08 Feb 2022 17:51:30 +0100

In some cases, the Index needs to be pre processed or post processed

ID mapping

By default, faiss will record an order id(1,2,3...,) for each input vector. You can also specify any required id for the vector in use.
Some index types have add_with_ids method can correspond to a 64 bit id for each vector and return the specified id when searching.

# Import faiss
import sys
import faiss
import numpy as np 

# Generate data and id
d = 512
n_data = 2000
data = np.random.rand(n_data, d).astype('float32')
ids = np.arange(100000, 102000)  # The id is set to a 6-digit integer
print(ids, len(ids))

[100000 100001 100002 ... 101997 101998 101999] 2000

nlist = 10  # The data set vector is divided into 10 vino spaces
quantizer = faiss.IndexFlatL2(d)  # Euclidean distance 
# quantizer = faiss.IndexFlatIP(d)    # Dot multiplication
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index.train(data)
index.add_with_ids(data, ids)
d, i = index.search(data[:5], 5)  # Search for vectors close to the first five vectors
print(i)  # The returned id should be set by yourself

[[100000 100563 101646 100741 100421]
 [100001 100727 100786 100269 101902]
 [100002 100800 100362 100835 101783]
 [100003 101986 101340 100803 101233]
 [100004 100902 101084 101562 101006]]

However, for some Index types, add is not supported_ with_ IDS, so it needs to be combined with other Index types to map the default id to the specified id, which is implemented with IndexIDMap class.
The specified ids cannot be a string, only an integer.

IndexFlatL2 does not support add_with_ids, the following statement reports an error

index = faiss.IndexFlatL2(data.shape[1]) 
index.add_with_ids(data, ids)  # error

add_with_ids not implemented for this type of index

IndexDMap supports add_with_ids

index = faiss.IndexFlatL2(data.shape[1]) 
index2 = faiss.IndexIDMap(index)  
index2.add_with_ids(data, ids)  # Mapping the id of index to the id of index2 will maintain a mapping table

data conversion

Sometimes you need to convert data before indexing. The transformation class inherits the VectorTransform class and converts the input vector into the output vector.

Random rotation, class name RandomRotationMatri, is used to balance the elements in the vector, generally before IndexPQ and IndexLSH;
PCA, class name PCAMatrix, dimensionality reduction;
Changing the dimension, the class name RemapDimensionsTransform, can increase or decrease the vector dimension

PCA dimensionality reduction (through IndexPreTransform)

The input vector is 2048 dimensions and needs to be reduced to 16 bytes

# Generate data and convert to format
data = np.random.rand(n_data, 2048).astype('float32')

# the IndexIVFPQ will be in 256D not 2048
coarse_quantizer = faiss.IndexFlatL2(256) 
sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, 16, 16, 8)

# PCA 2048->256
# Random rotation after dimension reduction (fourth parameter)
pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) 

# the wrapping index
index = faiss.IndexPreTransform (pca_matrix, sub_index)

# will also train the PCA
index.train(data)  # The data needs to be 2048 dimensions

# PCA will be applied prior to addition
index.add(data)

Upgrade dimension

Sometimes you need to insert an elevated dimension into a vector. Generally

d is an integral multiple of 4, which is conducive to example calculation
d is an integral multiple of M

d = 512
M = 8   # M is the number of subspaces divided in the dimension direction
d2 = int((d + M - 1) / M) * M
print(d2)
remapper = faiss.RemapDimensionsTransform (d, d2, True)
index_pq = faiss.IndexPQ(d2, M, 8)
index = faiss.IndexPreTransform (remapper, index_pq)  # Data / index can be added later

Reorder search results

When querying vectors, you can reorder the results with the real distance value.
In the following example, the search phase will first select 4 * 10 results, then calculate the real distance value for these results, and then select 10 results to return. IndexRefineFlat saves all vector information, and the memory overhead should not be underestimated.

data = np.random.rand(n_data, d).astype('float32')
nbits_per_index = 4
q = faiss.IndexPQ (d, M, nbits_per_index)
rq = faiss.IndexRefineFlat(q)
rq.train(data)
rq.add(data)
rq.k_factor = 4
dis, ind = rq.search (data[:5], 10)
print(ind)

[[   0 1747 1124  120 1625  129  345 1848 1833 1431]
 [   1  614  522 1578 1662 1813  737 1479  181  919]
 [   2 1182 1372 1901  871  523 1807   74  685  335]
 [   3 1130 1127 1426  181 1479 1064 1525 1113  931]
 [   4  696  944  217 1359 1987 1518 1880  755  490]]

Synthesize the results returned by multiple index es

When the data set is distributed in multiple indexes, you need to perform a search in each index, and then use IndexShards to synthesize the results. The same applies to the case where the index is distributed in different GPU s

Topics: Python index PCA

Programmer Think