In OpenSearch 3.1, we introduced memory-optimized search to enable vector search in memory-constrained environments. However, 16-bit floating point (FP16) vector processing remained a performance bottleneck. Over the next two releases, we progressively optimized FP16 distance calculations—first with SIMD in OpenSearch 3.4, then with bulk SIMD in OpenSearch 3.5, achieving up to 310% throughput improvement and dramatically reducing latency. This blog post presents the details of our optimization journey and the techniques that made these performance gains possible.
Optimizing FP16 performance
We improved FP16 performance through a series of optimizations: introducing memory-optimized search in OpenSearch 3.1, implementing SIMD distance calculations in 3.4, and adding bulk SIMD processing in 3.5.
OpenSearch 3.1: Memory-optimized search
In OpenSearch 3.1, we introduced memory-optimized search, enabling the use of Faiss indexes in environments with tight memory constraints, where available memory is smaller than the index size. This was achieved by combining Lucene’s search algorithm with a Faiss index. Thanks to Lucene’s early termination optimization, almost all vector types—except FP16—showed improved search QPS in multi-segment scenarios when the index was fully loaded in memory.
FP16 presented a bigger challenge. Conversion from FP16 to FP32 was performed in Java, meaning that even if a CPU could handle FP16-to-FP32 conversion in hardware, the JVM relied on a software-based conversion instead. Because the JVM lacks native FP16 support, FP16 vectors had to be encoded to FP32 before performing distance calculations.
This became a major performance bottleneck: searches using FP16 were nearly twice as slow compared to the default implementation.
OpenSearch 3.4: SIMD FP16 distance calculation
In OpenSearch 3.4, we addressed the FP16 performance limitation by intercepting the distance calculation and delegating it to C++ SIMD. From an implementation perspective, we leveraged the optimized SIMD code already provided by the Faiss library, which simplified the implementation.
Faiss SIMD uses SIMD registers to encode multiple FP16 values into FP32 and then performs operations on them simultaneously. This approach applies SIMD between a query and a single vector, significantly speeding up distance computations compared to the software-based calculations used in OpenSearch 3.1.
The following diagram illustrates the inner product computation using SIMD. It processes four dimensions of a vector simultaneously using a loop-unrolling technique, which optimizes and accelerates the computations.
OpenSearch 3.5: Bulk SIMD FP16 distance calculation
While the Faiss SIMD approach in OpenSearch 3.4 was already efficient, it only applied SIMD between a query and a single vector at a time. This meant that the same portion of the query vector had to be reloaded into the register for every vector comparison. We improved this by reusing loaded query values across multiple vectors whenever possible. For example, consider a 768-dimensional vector: when the first eight FP32 values are loaded into a SIMD register, they can be applied to multiple vectors simultaneously, rather than reloading them for each vector comparison. This approach is faster because performing operations in bulk between registers is much quicker than repeatedly loading values and processing them individually.
In OpenSearch 3.5, we introduced Bulk SIMD FP16 distance calculation. The key insight was that if the candidate vectors to evaluate are already known, distance calculations can be performed in bulk rather than comparing the query with each vector individually.
This is the core idea behind Bulk SIMD: we load the corresponding float values from multiple vectors into registers and compute distances, accumulating results all at once. By leveraging multiple registers simultaneously, many operations can be performed in parallel, resulting in significantly faster performance.
To illustrate how this works in practice, let’s examine the inner product calculation.
Inner product example
The following diagram illustrates how bulk SIMD calculates the inner product in parallel across multiple vectors.
Bulk SIMD processes multiple vector elements simultaneously rather than one by one. For example, the CPU can load four elements from a query vector and four elements from a data vector into a SIMD register, then compute their distance in parallel. On wider SIMD architectures (e.g., AVX2 or AVX-512), even more elements can be processed per instruction.
Because the computation occurs entirely in registers and the data is accessed sequentially:
- The L1 cache hit rate is high
- The CPU’s hardware prefetcher can automatically load upcoming elements
- Memory latency is effectively hidden
Thus, bulk SIMD improves throughput by combining parallel computation with efficient, cache-friendly memory access.
The following pseudocode presents the bulk SIMD approach:
// We know query and 4 candidate vectors uint8_t* Query_Vector <- Prepare Query Vector uint8_t* Vector1 <- Get Vector1's Pointer uint8_t* Vector2 <- Get Vector2's Pointer uint8_t* Vector3 <- Get Vector3's Pointer uint8_t* Vector4 <- Get Vector4's Pointer // Registers for accumulation // FMA stands for fused multiply-add, FMA(a, b, c) = a * b + c FP32_Register fmpSum1 FP32_Register fmpSum2 FP32_Register fmpSum3 FP32_Register fmpSum4 // For all values, we do bulk SIMD Inner product. // In this example, the dimension is assumed to be a multiple of 8 for simplicity. for (int i = 0 ; i < Dimension ; i += 8) { // Load 8 FP32 values into a register FP32_Register queryFloats <- Query_Vector[i:i+8] // Load 8 FP16 values into registers FP16_Register1 vec1Float16s <- Vector1[i:i+8] FP16_Register2 vec2Float16s <- Vector2[i:i+8] FP16_Register3 vec3Float16s <- Vector3[i:i+8] FP16_Register4 vec4Float16s <- Vector4[i:i+8] // Convert FP16 values to FP32 FP32_Register vec1Float32s <- ConvertToFP32(FP16_Register1) FP32_Register vec2Float32s <- ConvertToFP32(FP16_Register2) FP32_Register vec3Float32s <- ConvertToFP32(FP16_Register3) FP32_Register vec4Float32s <- ConvertToFP32(FP16_Register4) // Inner Product : SIMD FMA, accumulate = accumulate + q[i] * v[i] fmpSum1 = SIMD_FMA(fmpSum1, queryFloats, vec1Float32s) fmpSum2 = SIMD_FMA(fmpSum2, queryFloats, vec2Float32s) fmpSum3 = SIMD_FMA(fmpSum3, queryFloats, vec3Float32s) fmpSum4 = SIMD_FMA(fmpSum4, queryFloats, vec4Float32s) } // Set score values SCORE[0] = SUM(fmpSum1) SCORE[1] = SUM(fmpSum2) SCORE[2] = SUM(fmpSum3) SCORE[3] = SUM(fmpSum4)
For information about the ARM Neon implementation, see the k-NN repo.
Performance benchmarks
The following sections present performance benchmark results.
Benchmark environment
- Data set : Cohere-10M, 768 Dimension
- Node : r7i.4xlarge, r7g.4xlarge
- Shards : 3 nodes 1 replica
- Index type : FP16
- Number of segments : 80
Benchmark results
The following graph presents the benchmarking results.
The following table provides detailed throughput and latency metrics for each version and CPU architecture.
| Version | CPU architecture | Max throughput | Average latency | p90 latency | p99 latency |
|---|---|---|---|---|---|
| 3.1 | r7i | 398.87 | 209.66 | 300 | 330 |
| 3.1 | r7g | 495.49 | 168.64 | 235 | 253 |
| 3.4 | r7i | 1025.45 | 81.42 | 124 | 136 |
| 3.4 | r7g | 1112.13 | 75.09 | 111 | 120 |
| 3.5 | r7i | 1303.76 | 63.99 | 95 | 105 |
| 3.5 | r7g | 1477.88 | 56.42 | 82 | 91 |
Upgrading from OpenSearch 3.1 to 3.4 resulted in approximately 230% higher QPS and cut average latency in half, reducing p99 latency from about 300 ms to 120 ms. Moving from 3.4 to 3.5 added an additional 30% boost in throughput, bringing p99 latency down to an all-time low of 91 ms.
Overall, comparing OpenSearch 3.1 to 3.5 shows a total performance evolution: throughput increased by 310%, while latency dropped by nearly 300%. Bulk SIMD transformed the system from a slow baseline handling roughly 450 req/s into a high-performance engine capable of nearly 1,500 req/s with almost instantaneous responses.
What’s next?
From an implementation perspective, this optimization can also be applied to byte and FP32 indexes, which are our primary targets. We are actively exploring opportunities to further optimize their performance.
For binary indexes, however, bulk SIMD does not provide any performance improvement. We believe this is because the XOR operation is already heavily optimized in the JVM. Should opportunities for SIMD optimization in binary indexes emerge, we will evaluate them.
Try it out
Ready to experience these performance improvements? Upgrade to OpenSearch 3.5 and enable memory-optimized search with FP16 vector indexes to take advantage of bulk SIMD optimizations. We’d love to hear about your results and use cases on the OpenSearch forum. Your feedback helps us continue improving vector search performance for the community.