Batch Processing Semantic Highlighting in OpenSearch 3.3

In OpenSearch 3.0, we introduced semantic highlighting — an AI-powered feature that intelligently identifies relevant passages in search results based on meaning rather than exact keyword matches.

OpenSearch 3.3 introduces batch processing for externally hosted semantic highlighting models, reducing machine learning (ML) inference calls from N to 1 per search query. Our benchmarks demonstrate 100–1,300% performance improvements, depending on document length and result set size.

Try our demo now: Experience semantic highlighting on the OpenSearch ML Playground, presented in the following image.

What’s new: Batch processing for remote models

In OpenSearch 3.0, semantic highlighting processes each search result individually, making one ML inference call per result. For queries returning many results, this sequential approach can add latency that grows with result set size. OpenSearch 3.3 introduces a new approach (shown in the following diagram): collecting all search results and sending them in a single batched ML inference call, reducing overhead latency and improving GPU utilization.

Batch processing currently applies to remote semantic highlighting models only (those deployed on Amazon SageMaker or other external endpoints).

Using batch semantic highlighting in search requests

To get started with batch semantic highlighting in your searches, follow these steps. For a complete setup, see the semantic highlighting tutorial.

Step 1: Configure your remote model

To use batch processing, you’ll need a model that supports remote batch inference deployed on an external endpoint. Here’s how to integrate with an Amazon SageMaker endpoint hosted on AWS:

Create the necessary Amazon SageMaker model endpoint resources. For more information, see the README guide.
Deploy the model endpoint to OpenSearch. For more information, see the Amazon SageMaker blueprint guide.

Step 2: Enable system-generated pipelines

Add the following cluster setting to enable OpenSearch to automatically create the system-default batch semantic highlighting pipeline for processing search responses:

PUT /_cluster/settings
{
  "persistent": {
    "search.pipeline.enabled_system_generated_factories": ["semantic-highlighter"]
  }
}

Step 3: Add the batch flag to your query

Include batch_inference: true in your search request to enable batch semantic highlighting. The following example uses a neural query:

POST /neural-search-index/_search
{
  "size": 10,
  "query": {
    "neural": {
      "embedding": {
        "query_text": "treatments for neurodegenerative diseases",
        "model_id": "<your-text-embedding-model-id>",
        "k": 10
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {
        "type": "semantic"
      }
    },
    "options": {
      "model_id": "<your-semantic-highlighting-model-id>",
      "batch_inference": true
    }
  }
}

Your queries will now use batch processing automatically.

Performance benchmarks

We evaluated the performance impact of batch processing for semantic highlighting on the MultiSpanQA dataset. The test environment was configured as follows.

We tested with two document sets with different document lengths.

Dataset	Document length	Mean tokens	P50 tokens	P90 tokens	Max tokens
MultiSpanQA	Long documents	~303	~278	~513	~1,672
MultiSpanQA-Short	Short documents	~79	~70	~113	~213

Latency

We measured the latency overhead of semantic highlighting by comparing semantic search highlighting both when enabled and disabled. The baseline semantic search latency is approximately 20–25 ms across all configurations. The following table shows the highlighting overhead only (values exclude the baseline search time). All latency measurements are service-side took times from OpenSearch responses.

k value	Search client	Document length	P50 without batch processing (ms)	P50 with batch processing (ms)	P50 improvement	P90 without batch processing (ms)	P90 with batch processing (ms)	P90 improvement
10	1	Long	209	123	70%	262	179	46%
10	4	Long	378	171	121%	487	302	61%
10	8	Long	699	309	126%	955	624	53%
10	1	Short	175	55	218%	217	59	268%
10	4	Short	327	62	427%	445	120	271%
10	8	Short	610	101	504%	860	227	279%
50	1	Long	867	633	37%	999	717	39%
50	4	Long	1,937	912	112%	2,248	1,685	33%
50	8	Long	3,638	1,474	147%	4,355	3,107	40%
50	1	Short	760	82	827%	828	205	304%
50	4	Short	1,666	193	763%	1,971	362	445%
50	8	Short	3,162	219	1,344%	3,704	729	408%

The benchmarking demonstrates that batch processing reduces the semantic highlighting overhead. For short documents with k=50 and 8 clients, batch processing reduces highlighting latency from 3,162 ms to just 219 ms (P50)—a 1,344% improvement. The P90 latency also shows improvements (408%), demonstrating consistent performance benefits. The semantic search baseline (~25 ms) remains constant, so these improvements directly translate to faster end-to-end response times.

Key findings:

k=10: Moderate to significant improvement (46–504% for P50, 46–279% for P90).
k=50: Dramatic improvement (37–1,344% for P50, 33–445% for P90).
Short documents benefit more: Up to 1,344% faster (P50) compared to 147% for long documents at k=50.
- Why the difference?: Long documents could exceed the model’s 512 token limit, requiring multiple chunked inference runs even with batch processing. Short documents can be processed in a single pass, maximizing the benefit of batching.
P50 shows larger gains: Median latency improves more than tail latency, but both benefit significantly.

Throughput

To understand how batch processing affects the system’s capacity to handle concurrent requests, we also measured the throughput (mean number of operations per second). The results (presented in the following table) show consistent improvements across all configurations.

k value	Search clients	Doc length	Without batch (ops/s)	With batch (ops/s)	Improvement
10	1	Long	4.23	6.29	49%
10	4	Long	9.18	17.9	95%
10	8	Long	10.02	21.27	112%
10	1	Short	4.83	11.59	140%
10	4	Short	10.47	37.79	261%
10	8	Short	12.03	48.33	302%
50	1	Long	1.11	1.49	34%
50	4	Long	1.99	3.74	88%
50	8	Long	2.12	4.28	102%
50	1	Short	1.27	4.3	239%
50	4	Short	2.27	11.55	409%
50	8	Short	2.43	14.33	490%

Throughput improvements demonstrate that batch processing not only reduces individual query latency but also increases the overall system capacity, allowing you to serve more concurrent users with the same infrastructure.

Conclusion

Batch processing in OpenSearch 3.3 brings significant performance improvements to semantic highlighting for remote models. By reducing the number of ML inference calls from N to 1 per search, we’ve delivered:

Faster response times and higher query throughput when highlighting multiple search results.
More efficient use of remote model resources.
Backward-compatible queries (existing queries work as is).

Try batch processing for semantic highlighting and share your feedback on the OpenSearch forum.

Author

Junqiu Lei

Junqiu Lei is a software development engineer at Amazon Web Services working on vector search (k-NN) and map visualizations for the OpenSearch Project.

View all posts