In some cases, the Index needs to be pre processed or post processed
ID mapping
By default, faiss will record an order id(1,2,3...,) for each input vector. You can also specify any required id for the vector in use.
Some index types have add_with_ids method can correspond to a 64 bit id for each vector and return the specified id when searching.
# Import faiss import sys import faiss import numpy as np # Generate data and id d = 512 n_data = 2000 data = np.random.rand(n_data, d).astype('float32') ids = np.arange(100000, 102000) # The id is set to a 6-digit integer print(ids, len(ids))
[100000 100001 100002 ... 101997 101998 101999] 2000
nlist = 10 # The data set vector is divided into 10 vino spaces quantizer = faiss.IndexFlatL2(d) # Euclidean distance # quantizer = faiss.IndexFlatIP(d) # Dot multiplication index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2) index.train(data) index.add_with_ids(data, ids) d, i = index.search(data[:5], 5) # Search for vectors close to the first five vectors print(i) # The returned id should be set by yourself
[[100000 100563 101646 100741 100421] [100001 100727 100786 100269 101902] [100002 100800 100362 100835 101783] [100003 101986 101340 100803 101233] [100004 100902 101084 101562 101006]]
However, for some Index types, add is not supported_ with_ IDS, so it needs to be combined with other Index types to map the default id to the specified id, which is implemented with IndexIDMap class.
The specified ids cannot be a string, only an integer.
IndexFlatL2 does not support add_with_ids, the following statement reports an error
index = faiss.IndexFlatL2(data.shape[1]) index.add_with_ids(data, ids) # error
add_with_ids not implemented for this type of index
IndexDMap supports add_with_ids
index = faiss.IndexFlatL2(data.shape[1]) index2 = faiss.IndexIDMap(index) index2.add_with_ids(data, ids) # Mapping the id of index to the id of index2 will maintain a mapping table
data conversion
Sometimes you need to convert data before indexing. The transformation class inherits the VectorTransform class and converts the input vector into the output vector.
- Random rotation, class name RandomRotationMatri, is used to balance the elements in the vector, generally before IndexPQ and IndexLSH;
- PCA, class name PCAMatrix, dimensionality reduction;
- Changing the dimension, the class name RemapDimensionsTransform, can increase or decrease the vector dimension
PCA dimensionality reduction (through IndexPreTransform)
The input vector is 2048 dimensions and needs to be reduced to 16 bytes
# Generate data and convert to format data = np.random.rand(n_data, 2048).astype('float32') # the IndexIVFPQ will be in 256D not 2048 coarse_quantizer = faiss.IndexFlatL2(256) sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, 16, 16, 8) # PCA 2048->256 # Random rotation after dimension reduction (fourth parameter) pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) # the wrapping index index = faiss.IndexPreTransform (pca_matrix, sub_index) # will also train the PCA index.train(data) # The data needs to be 2048 dimensions # PCA will be applied prior to addition index.add(data)
Upgrade dimension
Sometimes you need to insert an elevated dimension into a vector. Generally
- d is an integral multiple of 4, which is conducive to example calculation
- d is an integral multiple of M
d = 512 M = 8 # M is the number of subspaces divided in the dimension direction d2 = int((d + M - 1) / M) * M print(d2) remapper = faiss.RemapDimensionsTransform (d, d2, True) index_pq = faiss.IndexPQ(d2, M, 8) index = faiss.IndexPreTransform (remapper, index_pq) # Data / index can be added later
512
Reorder search results
When querying vectors, you can reorder the results with the real distance value.
In the following example, the search phase will first select 4 * 10 results, then calculate the real distance value for these results, and then select 10 results to return. IndexRefineFlat saves all vector information, and the memory overhead should not be underestimated.
data = np.random.rand(n_data, d).astype('float32') nbits_per_index = 4 q = faiss.IndexPQ (d, M, nbits_per_index) rq = faiss.IndexRefineFlat(q) rq.train(data) rq.add(data) rq.k_factor = 4 dis, ind = rq.search (data[:5], 10) print(ind)
[[ 0 1747 1124 120 1625 129 345 1848 1833 1431] [ 1 614 522 1578 1662 1813 737 1479 181 919] [ 2 1182 1372 1901 871 523 1807 74 685 335] [ 3 1130 1127 1426 181 1479 1064 1525 1113 931] [ 4 696 944 217 1359 1987 1518 1880 755 490]]
Synthesize the results returned by multiple index es
When the data set is distributed in multiple indexes, you need to perform a search in each index, and then use IndexShards to synthesize the results. The same applies to the case where the index is distributed in different GPU s