Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing memory efficiency and cost-effectiveness

The rise of large language models (LLMs) and generative AI has ushered in a new era of natural language processing capabilities. Vector databases have emerged as a crucial component in this landscape, acting as external databases that can efficiently index, store, and retrieve embeddings generated by LLMs. However, as the scale and complexity of LLMs continue to grow, vector database workloads have also increased significantly. Ingesting and querying billions of vectors can strain computational resources, leading to higher memory requirements and increased operational costs. Faiss scalar quantization enables you to store vector embeddings with lower precision, which reduces memory consumption and, consequently, lowers costs.

Why use Faiss scalar quantization?

When you index vectors in OpenSearch 2.13 or later versions, you can configure your k-NN index to apply scalar quantization. Scalar quantization converts each dimension of a vector from a 32-bit floating-point (fp32) to a 16-bit floating-point (fp16) representation. Using the Faiss scalar quantizer (SQfp16), integrated in the k-NN plugin, saves about 50% of the memory with minimal reduction in recall (see Benchmarking results). When used with SIMD optimization, SQfp16 quantization can also significantly reduce search latencies and improve indexing throughput.

How to use Faiss scalar quantization

To use Faiss scalar quantization, set the k-NN vector field’s method.parameters.encoder.name to sq when creating a k-NN index:

PUT /test-index
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 8,
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "space_type": "l2",
          "parameters": {
            "encoder": {
              "name": "sq",
              "parameters": {
                "type": "fp16",
                "clip": true
              }
            },
            "ef_construction": 256,
            "m": 8
          }
        }
      }
    }
  }
}

For more information about the SQ parameters, see the k-NN documentation.

The fp16 encoder converts 32-bit vectors into their 16-bit counterparts. For this encoder type, the vector values must be in the range [-65504.0, 65504.0].

The clip parameter above specifies how to handle out-of-range values:

By default, clip is false, and any vectors containing out-of-range values are rejected.
When clip is set to true, out of-range vector values are rounded up or down so that they are in the supported range. For example, if the original 32-bit vector is [65510.82, -65504.1], the vector will be indexed in the range [65504.0, -65504.0].

Note: We recommend setting clip to true only if very few elements lie outside of the supported range. Rounding the values may cause a drop in recall.

During ingestion, make sure each dimension of the vector is within the supported range ([-65504.0, 65504.0]):

PUT test-index/_doc/1
{
  "my_vector1": [-65504.0, 65503.845, 55.82, -65300.456, 34.67, -1278.23, 90.62, 8.36]
}

During querying, there is no range limitation for the query vector:

GET test-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [265436.876, -120906.256, 99.84, 89.45, 100000.45, 9.23, -70.17, 6.93],
        "k": 2
      }
    }
  }
}

HNSW memory estimation with fp16

The memory required for HNSW is estimated to be 1.1 * (2 * dimension + 8 * M) bytes/vector.

As an example, assume that you have 1 million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows:

1.1 * (2 * 256 + 8 * 16) * 1,000,000 ~= 0.656 GB

For more information about memory estimation for scalar quantization with the inverted file (IVF) algorithm, refer to this documentation.

Benchmarking results

We ran benchmarking tests on some popular datasets using our opensearch-benchmark tool to compare the indexing, search performance, and quality of search results of Faiss scalar quantization. We compared Faiss scalar quantization (FP16) against using Faiss with float vectors without any encoding (FP32). All tests were performed with SIMD (Single Instruction Multiple Data). enabled on x86 architecture with AVX2 optimization.

Note: Without SIMD optimization (AVX2 or NEON) or with AVX2 disabled (on x86 architecture), the quantization process introduces additional overhead, which leads to an increase in latency. For information about processors that support AVX2, see CPUs with AVX2. In an AWS environment, all community Amazon Machine Images (AMIs) with HVM support AVX2 optimization for the x86 architecture.

Benchmarking results using small workloads

We ran the following tests on a single-node cluster without any replicas.

Configuration

m	ef_construction	ef_search	replica
16	100	100	0

The dataset and other configuration details are listed in the following table.

Dataset ID	Dataset	Vector dimension	Data size	Number of queries	Training data range	Query data range	Space type	Primary shards	Indexing clients
Dataset 1	gist-960-euclidean	960	1,000,000	1,000	[ 0.0, 1.48 ]	[ 0.0, 0.729 ]	L2	8	16
Dataset 2	mnist-784-euclidean	784	60,000	10,000	[ 0.0, 255.0 ]	[ 0.0, 255.0 ]	L2	1	2
Dataset 3	cohere-wiki-simple-embeddings-768	768	475,858	10,000	[ -4.1561704, 5.5478516 ]	[ -4.065383, 5.4902344 ]	L2	4	8
Dataset 4	cohere-ip-1m	768	1,000,000	10,000	[ -4.1073565, 5.504557 ]	[ -4.109505, 5.4809895 ]	innerproduct	8	16
Dataset 5	sift-128-euclidean	128	1,000,000	10,000	[ 0.0, 218.0 ]	[ 0.0, 184.0 ]	L2	8	16

Recall and memory results

Dataset ID	Faiss hnsw recall@100	Faiss hnsw-sqfp16 recall@100	Faiss hnsw memory estimate (gb)	Faiss hnsw-sqfp16 memory estimate (gb)	Faiss hnsw memory usage (gb)	Faiss hnsw-sqfp16 memory usage (gb)	% reduction in memory
Dataset 1	0.91	0.91	4.07	2.10	3.72	1.93	48
Dataset 2	0.99	0.99	0.20	0.10	0.18	0.10	44
Dataset 3	0.95	0.95	1.56	0.81	1.43	0.75	48
Dataset 4	0.94	0.94	3.28	1.70	3.00	1.57	48
Dataset 5	0.99	0.99	0.66	0.39	0.62	0.38	39

Indexing and query results

Dataset ID	Faiss hnsw mean throughput (docs/sec)	Faiss hnsw-sqfp16 mean throughput (docs/sec)	Faiss hnsw p90 (ms)	Faiss hnsw-sqfp16 p90 (ms)	Faiss hnsw p99 (ms)	Faiss hnsw-sqfp16 p99 (ms)
Dataset 1	4681	4696	4.97	5.08	5.54	5.50
Dataset 2	4271	4580	2.01	2.06	2.16	2.21
Dataset 3	4690	4698	3.35	3.33	3.58	3.57
Dataset 4	6044	6129	4.61	4.81	5.16	5.37
Dataset 5	115499	102060	2.73	2.68	2.96	2.89

Analysis

When comparing the benchmarking results, note that:

The recall obtained using Faiss HNSW SQfp16 matches that of Faiss HNSW (with a negligible difference).
Using SQfp16, there is a significant reduction in memory usage of up to 48%, with a slight reduction in disk usage. These results indicate that a larger vector dimension leads to greater memory reduction.
When using SQfp16, the performance metrics are similar to those of fp32 vectors.

Benchmarking results using large workloads

To compare performance metrics and memory savings, we ran tests on the large-scale Laion 100M dataset with 768 dimensions, using both Faiss HNSW SQfp16 and Faiss HNSW.

Configuration

	Faiss HNSW SQfp16	Faiss HNSW
OpenSearch version	2.13	2.13
Engine	faiss	faiss
Vector dimension	768	768
Ingest vectors	100M	100M
Test vectors	1k	1k
Primary shards	36	36
Replica shards	0	0
Data nodes	4	8
Data node instance type	r5.4xlarge	r5.4xlarge
Cluster manager nodes	3	3
Cluster manager node instance type	c5.xlarge	c5.xlarge
Indexing clients	9	9
Query clients	1	1
Force merge segments	1	1
Client instance	r5.16xlarge	r5.16xlarge

Faiss HNSW SQfp16 requires 4 data nodes—half the number needed for Faiss HNSW (8). This demonstrates that SQfp16 reduces memory requirements by 50%. For more information about estimating the required memory and number of data nodes, see the Appendix.

Config ID	Optimization strategy	m	ef_construction	ef_search
hnsw1	Default configuration	16	100	100
hnsw2	Balance between latency, memory, and recall	16	128	128
hnsw3	Optimize for recall	16	256	256

Recall and memory results

Experiment ID	hnsw-recall@1000	hnsw-sqfp16-recall@1000	hnsw memory usage (gb)	hnsw-sqfp16 memory usage (gb)	% reduction in memory
hnsw 1	0.94	0.94	300.28	157.23	47.64
hnsw 2	0.96	0.96	300.28	157.23	47.64
hnsw 3	0.98	0.98	300.28	157.23	47.64

Indexing and query results

Experiment ID	hnsw mean throughput (docs/sec)	hnsw-sqfp16 mean throughput (docs/sec)	hnsw p90 (ms)	hnsw-sqfp16 p90 (ms)	hnsw p99 (ms)	hnsw-sqfp16 p99 (ms)
hnsw 1	7544	7657	14.02	16.99	19.18	20.83
hnsw 2	7063	7219	14.21	17.44	18.86	21.80
hnsw 3	6004	5848	16.14	20.85	17.65	24.73

Analysis

For k=1000, the recall is identical for both Faiss HNSW and Faiss HNSW with SQfp16.
Faiss HNSW with SQfp16 requires approximately half the memory as Faiss HNSW (as measured by the required number of data nodes). Based on the k-NN stats API metrics, the memory usage was reduced by 47.64% by using SQfp16.
In most instances, SQfp16 demonstrated better indexing throughput as compared to fp32 vectors.

Conclusion

Faiss SQfp16 scalar quantization is a powerful technique that provides significant memory savings while maintaining high recall performance similar to full-precision vectors. Converting vectors to a 16-bit floating-point representation can reduce memory requirements by up to 50%. When combined with SIMD optimization, SQfp16 scalar quantization also enhances indexing throughput and reduces search latency, leading to better overall performance. This method strikes an excellent balance between memory efficiency and accuracy, making it a valuable tool for large-scale similarity search applications.

Future scope

To achieve even greater memory efficiency, we plan to introduce int8 quantization support using a Faiss scalar quantizer and Lucene scalar quantizer. This technique will enable a remarkable 75% reduction in memory requirements, or 4x compression, compared to full-precision vectors and we expect to find minimal reduction in recall. The quantizers will accept fp32 vectors as input, perform online training, and quantize the data into byte-sized vectors, eliminating the need for external quantization or extra training steps.

Furthermore, we aim to release binary vector support, enabling an unprecedented 32x compression rate. This approach will further reduce memory consumption. Moreover, we plan to incorporate AVX-512 optimization, which will contribute to further reducing search latency.

Our ongoing analysis and tuning of OpenSearch lets you address large-scale similarity search while minimizing resource requirements and maximizing cost-effectiveness.

Appendix: Memory and data node requirement estimation

Here are some estimates of the amount of memory and number of data nodes needed for the 100M, 768 dimension large workload benchmarking test:

// Faiss HNSW SQfp16 Memory Estimation
1.1 * (2 * dimension + 8 * M) * num_of_vectors * (1 + num_of_replicas) bytes

Let m = 16 and num_replicas = 0

1.1 * (2 * 768 + 8 * 16) * 100000000 * (1 + 0) = 170.47 gb = 171 gb

Instance r5.4xlarge has a memory of 128 gb in which 32 gb is used for JVM. 
Let us assume circuit breaker limit is 0.5

Total available memory = (data node instance memory - jvm memory) * circuit breaker limit
Total available memory = (128 - 32 ) * 0.5 = 48gb

Number of Data nodes -> 171/48 = 3.56 = 4

// Faiss HNSW Memory Estimation
1.1 * (4 * dimension + 8 * M) * num_of_vectors * (1 + num_of_replicas) bytes

Let m = 16 and num_replicas = 0

1.1 * (4 * 768 + 8 * 16) * 100000000 * (1 + 0) = 327.83 gb = 328 gb

Instance r5.4xlarge has a memory of 128 gb in which 32 gb is used for JVM. 
Let us assume circuit breaker limit is 0.5

Total available memory = (data node instance memory - jvm memory) * circuit breaker limit
Total available memory = (128 - 32 ) * 0.5 = 48gb

Number of Data nodes -> 328/48 = 6.83 = 7 + 1(for stability) = 8

References

Benchmarking datasets
Cohere/wikipedia-22-12-simple-embeddings
Laion
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2210.08402
Douze, Matthijs, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar’e, Maria Lomeli, Lucas Hosseini and Herv’e J’egou. The Faiss library. https://arxiv.org/abs/2401.08281

Authors

Naveen Tatikonda

Naveen Tatikonda is a software engineer at AWS working on the OpenSearch Project and Amazon OpenSearch Service. His interests include distributed systems and vector search. He is an active contributor to various plugins like k-NN, GeoSpatial.
View all posts
Vamshi Vijay Nakkirtha

Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.
View all posts
Tal Wagner

Tal Wagner is a senior applied scientist at AWS working on the OpenSearch Project and Amazon OpenSearch Service. He obtained his PhD in Computer Science at CSAIL, MIT in 2020. His interests include designing algorithms for massive datasets and large-scale machine learning.
View all posts

Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing memory efficiency and cost-effectiveness

Why use Faiss scalar quantization?

How to use Faiss scalar quantization

HNSW memory estimation with fp16

Benchmarking results

Benchmarking results using small workloads

Configuration

Recall and memory results

Indexing and query results

Analysis

Benchmarking results using large workloads

Configuration

Recall and memory results

Indexing and query results

Analysis

Conclusion

Future scope

Appendix: Memory and data node requirement estimation

References

Authors

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.

Participate

Providers

Resources