Skip to main content
search

In OpenSearch 3.0, we introduced semantic highlighting — an AI-powered feature that intelligently identifies relevant passages in search results based on meaning rather than exact keyword matches.

OpenSearch 3.3 introduces batch processing for externally hosted semantic highlighting models, reducing machine learning (ML) inference calls from N to 1 per search query. Our benchmarks demonstrate 100–1,300% performance improvements, depending on document length and result set size.

Try our demo now: Experience semantic highlighting on the OpenSearch ML Playground, presented in the following image.

What’s new: Batch processing for remote models

In OpenSearch 3.0, semantic highlighting processes each search result individually, making one ML inference call per result. For queries returning many results, this sequential approach can add latency that grows with result set size. OpenSearch 3.3 introduces a new approach (shown in the following diagram): collecting all search results and sending them in a single batched ML inference call, reducing overhead latency and improving GPU utilization.

Batch processing currently applies to remote semantic highlighting models only (those deployed on Amazon SageMaker or other external endpoints).

Using batch semantic highlighting in search requests

To get started with batch semantic highlighting in your searches, follow these steps. For a complete setup, see the semantic highlighting tutorial.

Step 1: Configure your remote model

To use batch processing, you’ll need a model that supports remote batch inference deployed on an external endpoint. Here’s how to integrate with an Amazon SageMaker endpoint hosted on AWS:

Step 2: Enable system-generated pipelines

Add the following cluster setting to enable OpenSearch to automatically create the system-default batch semantic highlighting pipeline for processing search responses:

PUT /_cluster/settings
{
  "persistent": {
    "search.pipeline.enabled_system_generated_factories": ["semantic-highlighter"]
  }
}

Step 3: Add the batch flag to your query

Include batch_inference: true in your search request to enable batch semantic highlighting. The following example uses a neural query:

POST /neural-search-index/_search
{
  "size": 10,
  "query": {
    "neural": {
      "embedding": {
        "query_text": "treatments for neurodegenerative diseases",
        "model_id": "<your-text-embedding-model-id>",
        "k": 10
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {
        "type": "semantic"
      }
    },
    "options": {
      "model_id": "<your-semantic-highlighting-model-id>",
      "batch_inference": true
    }
  }
}

Your queries will now use batch processing automatically.

Performance benchmarks

We evaluated the performance impact of batch processing for semantic highlighting on the MultiSpanQA dataset. The test environment was configured as follows.

OpenSearch cluster | Version 3.3.0 deployed on AWS (us-east-2) | | Data nodes | 3 × r6g.2xlarge (8 vCPUs, 64 GB memory each) | | Manager nodes | 3 × c6g.xlarge (4 vCPUs, 8 GB memory each) | | Semantic highlighting model | opensearch-semantic-highlighter-v1 deployed remotely at Amazon SageMaker endpoint with single GPU-based ml.g5.xlarge | | Embedding model | sentence-transformers/all-MiniLM-L6-v2 deployed within OpenSearch cluster | | Benchmark client | ARM64, 16 cores, 61 GB RAM | | Test configuration | 10 warmup iterations, 50 test iterations, 3 shards, 0 replicas |

We tested with two document sets with different document lengths.

Dataset Document length Mean tokens P50 tokens P90 tokens Max tokens
MultiSpanQA Long documents ~303 ~278 ~513 ~1,672
MultiSpanQA-Short Short documents ~79 ~70 ~113 ~213

Latency

We measured the latency overhead of semantic highlighting by comparing semantic search highlighting both when enabled and disabled. The baseline semantic search latency is approximately 20–25 ms across all configurations. The following table shows the highlighting overhead only (values exclude the baseline search time). All latency measurements are service-side took times from OpenSearch responses.

k value Search client Document length P50 without batch processing (ms) P50 with batch processing (ms) P50 improvement P90 without batch processing (ms) P90 with batch processing (ms) P90 improvement
10 1 Long 209 123 70% 262 179 46%
10 4 Long 378 171 121% 487 302 61%
10 8 Long 699 309 126% 955 624 53%
10 1 Short 175 55 218% 217 59 268%
10 4 Short 327 62 427% 445 120 271%
10 8 Short 610 101 504% 860 227 279%
50 1 Long 867 633 37% 999 717 39%
50 4 Long 1,937 912 112% 2,248 1,685 33%
50 8 Long 3,638 1,474 147% 4,355 3,107 40%
50 1 Short 760 82 827% 828 205 304%
50 4 Short 1,666 193 763% 1,971 362 445%
50 8 Short 3,162 219 1,344% 3,704 729 408%

The benchmarking demonstrates that batch processing reduces the semantic highlighting overhead. For short documents with k=50 and 8 clients, batch processing reduces highlighting latency from 3,162 ms to just 219 ms (P50)—a 1,344% improvement. The P90 latency also shows improvements (408%), demonstrating consistent performance benefits. The semantic search baseline (~25 ms) remains constant, so these improvements directly translate to faster end-to-end response times.

Key findings:

  • k=10: Moderate to significant improvement (46–504% for P50, 46–279% for P90).
  • k=50: Dramatic improvement (37–1,344% for P50, 33–445% for P90).
  • Short documents benefit more: Up to 1,344% faster (P50) compared to 147% for long documents at k=50.
    • Why the difference?: Long documents could exceed the model’s 512 token limit, requiring multiple chunked inference runs even with batch processing. Short documents can be processed in a single pass, maximizing the benefit of batching.
  • P50 shows larger gains: Median latency improves more than tail latency, but both benefit significantly.

Throughput

To understand how batch processing affects the system’s capacity to handle concurrent requests, we also measured the throughput (mean number of operations per second). The results (presented in the following table) show consistent improvements across all configurations.

k value Search clients Doc length Without batch (ops/s) With batch (ops/s) Improvement
10 1 Long 4.23 6.29 49%
10 4 Long 9.18 17.9 95%
10 8 Long 10.02 21.27 112%
10 1 Short 4.83 11.59 140%
10 4 Short 10.47 37.79 261%
10 8 Short 12.03 48.33 302%
50 1 Long 1.11 1.49 34%
50 4 Long 1.99 3.74 88%
50 8 Long 2.12 4.28 102%
50 1 Short 1.27 4.3 239%
50 4 Short 2.27 11.55 409%
50 8 Short 2.43 14.33 490%

Throughput improvements demonstrate that batch processing not only reduces individual query latency but also increases the overall system capacity, allowing you to serve more concurrent users with the same infrastructure.

Conclusion

Batch processing in OpenSearch 3.3 brings significant performance improvements to semantic highlighting for remote models. By reducing the number of ML inference calls from N to 1 per search, we’ve delivered:

  • Faster response times and higher query throughput when highlighting multiple search results.
  • More efficient use of remote model resources.
  • Backward-compatible queries (existing queries work as is).

Try batch processing for semantic highlighting and share your feedback on the OpenSearch forum.

Author

  • Junqiu Lei is a software development engineer at Amazon Web Services working on vector search (k-NN) and map visualizations for the OpenSearch Project.

    View all posts