Boosting Hybrid query performance in OpenSearch 2.15

Tue, Jul 02, 2024 · Martin Gaievski, Varun Jain, Vamshi Vijay Nakkirtha, Stavros Macrakis

Since its introduction in OpenSearch 2.10, hybrid search has become popular among users looking to enhance the relevance of their semantic search results. By combining full-text search and semantic search, hybrid queries deliver superior results for various applications, including e-commerce, document search, log analytics, and data exploration. However, managing large datasets and complex queries can sometimes lead to performance issues.

With each new release, OpenSearch has implemented numerous enhancements to improve the performance of hybrid search at scale. In version 2.15, these enhancements led to hybrid query performance improving by up to 70% compared to version 2.13.

Improvements in OpenSearch 2.15

The development team made improvements by analyzing the code and optimizing performance bottlenecks. We focused on the following areas:

  • Conditional scoring logic: Previously, the core logic for collecting scores during a query was fixed, so computations were performed regardless of whether they were needed. This often led to unnecessary calculations, especially when scoring computations were redundant for specific query types or plugins. In OpenSearch 2.15, we made the scoring logic conditional, allowing certain computations to be skipped if they are not needed by the plugin currently in use. This optimization reduces computational overhead, accelerates query processing, and improves resource utilization.

    The performance improvements resulting from this change are substantial. Our benchmarks show a 20% increase in query processing speed for some use cases. For more information, see the following GitHub issues:

  • Replacement of inefficient constructs: We analyzed the performance of version 2.13 and found that the Java Streams API, while convenient, introduced unnecessary overhead in high-performance scenarios. This was particularly evident in areas with intensive data processing requirements. In version 2.15, we replaced Java Streams constructs with more efficient alternatives, such as for loops and optimized data handling techniques. This change resulted in a performance gain of up to 25% for specific data processing tasks, helping OpenSearch handle larger datasets and more complex queries more efficiently. For more information, see the following GitHub issue:
  • Elimination of unnecessary calculations: We found that certain expensive calculations, such as computing hash codes for query objects, are unnecessary. By removing these calculations, resources are allocated more efficiently, speeding up hybrid queries. This change has improved query processing speed by 20%. For more information, see the following GitHub issue:
  • Optimized data structures: We improved the use of priority queues, which are used for some sorting operations. By changing the allocation strategy of query hits objects to perform lazy initialization, we removed the lowest-score element when the queue reaches full capacity. Our benchmarks show that this optimization resulted in a performance gain of up to 10% in query processing speed for specific data processing tasks. For more information, see the following GitHub issue:
  • Reducing repetitive calculations: To mitigate redundant internal calculations, we have implemented value caching and reuse strategies, reducing the overall computational overhead within the system. By optimizing the handling of repetitive calculations and promoting value reuse, we have sped up the system by 5%. For more information, see the following GitHub issues:

Benchmark results

Our benchmark results show a performance improvement of up to 70% for large datasets (over 10M) in hybrid queries with OpenSearch 2.15 compared to version 2.13. These benchmarks were conducted using a new OpenSearch Benchmark workload created specifically for evaluating semantic-search use cases. The following table summarizes the benchmark results.

Number of documents retrieved Number of hybrid sub-queries OpenSearch version 2.13 OpenSearch version 2.15 Performance improvement
p50, ms p90, ms p99, ms p50, ms p90, ms p99, ms %
1.6K 1 75 77 78 75 76 76 1
1.6M 1 224 240 245 109 114 119 52
10M 1 729 841 868 237 257 264 70
15M 3 1224 1300 1367 294 330 343 75
% average boost in 2.15 vs 2.13 49

Planned improvements

We will continue to analyze OpenSearch performance metrics and identify opportunities for further enhancements. Future improvements may include:

  • Advanced optimization techniques for hybrid queries: Iterating over batches of documents rather than individual ones to further reduce latency and enhance performance. This technique aims to streamline hybrid query processing by minimizing the computational overhead associated with handling large volumes of data.

  • Algorithmic refinements: Refining existing algorithms and introducing new ones better suited for hybrid search. This includes optimizing ranking and scoring mechanisms to ensure more accurate and faster results.

Ongoing initiatives to provide continuous performance insights and improvements include the following:

  • Nightly benchmark runs: Starting with version 2.15, we will publish the results of hybrid query nightly benchmark runs so you can track performance changes between versions. These results will be available on the OpenSearch Performance Benchmarks page.

  • Enhanced benchmark workloads: We’ll expand benchmark workloads with extensions in order to gather metrics for vector search queries, in addition to text search queries.

These enhancements will make OpenSearch more powerful and efficient.


References

  1. [META] Improve Hybrid query latency
  2. OpenSearch Benchmark workload for semantic search
  3. OpenSearch Performance Benchmarks
  4. Improve search relevance with hybrid search, generally available in OpenSearch 2.10
  5. Hybrid query