Improving document retrieval with sparse semantic encoders

Tue, Dec 05, 2023 · Zhichao Geng, Xinyuan Lu, Dagney Braun, Charlie Yang, Fanit Kolchina

OpenSearch 2.11 introduced neural sparse search—a new efficient method of semantic retrieval. In this blog post, you’ll learn about using sparse encoders for semantic search. You’ll find that neural sparse search reduces costs, performs faster, and improves search relevance. We’re excited to share benchmarking results and show how neural sparse search outperforms other search methods. You can even try it out by building your own search engine in just five steps. To skip straight to the results, see Benchmarking results.

What are dense and sparse vector embeddings?

When you use a transformer-based encoder, such as BERT, to generate traditional dense vector embeddings, the encoder translates each word into a vector. Collectively, these vectors make up a semantic vector space. In this space, the closer the vectors are, the more similar the words are in meaning.

In sparse encoding, the encoder uses the text to create a list of tokens that have similar semantic meaning. The model vocabulary (WordPiece) contains most commonly used words along with various tense endings (for example, -ed and -ing) and suffixes (for example, -ate and -ion). You can think of the vocabulary as a semantic space where each document is a sparse vector.

The following images show example results of dense and sparse encoding.

Left: Dense vector semantic space. Right: Sparse vector semantic space.

Sparse encoders use more efficient data structures

In dense encoding, documents are represented as high-dimensional vectors. To search these documents, you need to use a k-NN index as an underlying data structure. In contrast, sparse search can use a native Lucene index because sparse encodings are similar to term vectors used by keyword-based matching.

Compared to k-NN indexes, sparse embeddings have the following cost-reducing advantages:

  1. Much smaller index size
  2. Reduced runtime RAM cost
  3. Lower computational cost

For a detailed comparison, see Table II.

Sparse encoders perform better on unfamiliar datasets

In our previous blog post, we mentioned that searching with dense embeddings presents challenges when encoders encounter unfamiliar content. When an encoder trained on one dataset is used on a different dataset, the encoder often produces unpredictable embeddings, resulting in poor search result relevance.

Often, BM25 performs better than dense encoders on BEIR datasets that incorporate strong domain knowledge. In these cases, sparse encoders can fall back on keyword-based matching, ensuring that their search results are no worse than those produced by BM25. For a comparison of search result relevance benchmarks, see Table I.

Among sparse encoders, document-only encoders are the most efficient

You can run a neural sparse search in two modes: bi-encoder and document-only.

In bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized. In this mode, document encoders are trained to learn more synonym association in order to increase recall. By eliminating the online inference phase, you can save computational resources and significantly reduce latency. For benchmarks, compare the Neural sparse document-only column with the other columns in Table II.

Neural sparse search outperforms other search methods in benchmarking tests

For benchmarking, we used a cluster containing 3 r5.8xlarge data nodes and 1 r5.12xlarge leader/machine learning (ML) node. We measured search relevance for all evaluated search methods in terms of NCDG@10. Additionally, we compared the runtime speed and the resource cost of each method.

Here are the key takeaways:

  • Both modes provide the highest relevance on the BEIR and Amazon ESCI datasets.
  • Without online inference, the search latency of document-only mode is comparable to BM25.
  • Sparse encoding results in a much smaller index size than dense encoding. A document-only sparse encoder generates an index that is 10.4% of the size of a dense encoding index. For a bi-encoder, the index size is 7.2% of the size of a dense encoding index.
  • Dense encoding uses k-NN retrieval and incurs a 7.9% increase in RAM cost at search time. Neural sparse search uses a native Lucene index, so the RAM cost does not increase at search time.

Benchmarking results

The benchmarking results are presented in the following tables.

Table I. Relevance comparison on BEIR benchmark and Amazon ESCI in terms of NDCG@10 and rank

BM25 Dense (with TAS-B model) Hybrid (Dense + BM25) Neural sparse search bi-encoder Neural sparse search document-only
Dataset NDCG Rank NDCG Rank NDCG Rank NDCG Rank NDCG Rank
Trec-Covid 0.688 4 0.481 5 0.698 3 0.771 1 0.707 2
NFCorpus 0.327 4 0.319 5 0.335 3 0.36 1 0.352 2
NQ 0.326 5 0.463 3 0.418 4 0.553 1 0.521 2
HotpotQA 0.602 4 0.579 5 0.636 3 0.697 1 0.677 2
FiQA 0.254 5 0.3 4 0.322 3 0.376 1 0.344 2
ArguAna 0.472 2 0.427 4 0.378 5 0.508 1 0.461 3
Touche 0.347 1 0.162 5 0.313 2 0.278 4 0.294 3
DBPedia 0.287 5 0.383 4 0.387 3 0.447 1 0.412 2
SciDocs 0.165 2 0.149 5 0.174 1 0.164 3 0.154 4
FEVER 0.649 5 0.697 4 0.77 2 0.821 1 0.743 3
Climate FEVER 0.186 5 0.228 3 0.251 2 0.263 1 0.202 4
SciFact 0.69 3 0.643 5 0.672 4 0.723 1 0.716 2
Quora 0.789 4 0.835 3 0.864 1 0.856 2 0.788 5
Amazon ESCI 0.081 3 0.071 5 0.086 2 0.077 4 0.095 1
Average 0.419 3.71 0.41 4.29 0.45 2.71 0.492 1.64 0.462 2.64

* For more information about Benchmarking Information Retrieval (BEIR), see the BEIR GitHub page.

Table II. Speed comparison in terms of latency and throughput

  BM25 Dense (with TAS-B model) Neural sparse search bi-encoder Neural sparse search document-only
P50 latency (ms) 8 ms 56.6 ms 176.3 ms 10.2ms
P90 latency (ms) 12.4 ms 71.12 ms 267.3 ms 15.2ms
P99 Latency (ms) 18.9 ms 86.8 ms 383.5 ms 22ms
Max throughput (op/s) 2215.8 op/s 318.5 op/s 107.4 op/s 1797.9 op/s
Mean throughput (op/s) 2214.6 op/s 298.2 op/s 106.3 op/s 1790.2 op/s

* We tested latency on a subset of MS MARCO v2 containing 1M documents in total. To obtain latency data, we used 20 clients to loop search requests.

Table III. Resource consumption comparison

  BM25 Dense (with TAS-B model) Neural sparse search bi-encoder Neural sparse search document-only
Index size 1 GB 65.4 GB 4.7 GB 6.8 GB
RAM usage 480.74 GB 675.36 GB 480.64 GB 494.25 GB
Runtime RAM delta +0.01 GB +53.34 GB +0.06 GB +0.03 GB

* We performed this experiment using the full MS MARCO v2 dataset, containing 8.8M passages. For all methods, we excluded the _source fields and force merged the index before measuring index size. We set the heap size of the OpenSearch JVM to half of the node RAM, so an empty OpenSearch cluster still consumed close to 480 GB of memory.

Build your search engine in five steps

Follow these steps to build your search engine:

  1. Prerequisites: For this simple setup, update the following cluster settings:

     PUT /_cluster/settings
         "transient": {
             "plugins.ml_commons.allow_registering_model_via_url": true,
             "plugins.ml_commons.only_run_on_ml_node": false,
             "plugins.ml_commons.native_memory_threshold": 99

    For more information about ML-related cluster settings, see ML Commons cluster settings.

  2. Deploy encoders: The ML Commons plugin supports deploying pretrained models using a URL. For this example, you’ll deploy the opensearch-neural-sparse-encoding encoder:

     POST /_plugins/_ml/models/_register?deploy=true
         "name": "opensearch-neural-sparse-encoding",
         "version": "1.0.0",
         "description": "opensearch-neural-sparse-encoding",
         "model_format": "TORCH_SCRIPT",
         "function_name": "SPARSE_ENCODING",
         "model_content_hash_value": "d1ebaa26615090bdb0195a62b180afd2a8524c68c5d406a11ad787267f515ea8",
         "url": ""

    OpenSearch responds with a task_id:

         "task_id": "<task_id>",
         "status": "CREATED"

    Use the task_id to check the status of the task:

     GET /_plugins/_ml/tasks/<task_id>

    Once the task is complete, the task state changes to COMPLETED and OpenSearch returns the model_id for the deployed model:

         "model_id": "<model_id>",
         "task_type": "REGISTER_MODEL",
         "function_name": "SPARSE_TOKENIZE",
         "state": "COMPLETED",
         "worker_node": [
         "create_time": 1701390988405,
         "last_update_time": 1701390993724,
         "is_async": true
  3. Set up ingestion: In OpenSearch, a sparse_encoding ingest processor encodes documents into sparse vectors before indexing them. Create an ingest pipeline as follows:

     PUT /_ingest/pipeline/neural-sparse-pipeline
         "description": "An example neural sparse encoding pipeline",
         "processors" : [
                 "sparse_encoding": {
                     "model_id": "<model_id>",
                     "field_map": {
                         "passage_text": "passage_embedding"
  4. Set up index mapping: Neural search uses the rank_features field type to store token weights when documents are indexed. The index will use the ingest pipeline you created to generate text embeddings. Create the index as follows:

     PUT /my-neural-sparse-index
         "settings": {
             "default_pipeline": "neural-sparse-pipeline"
         "mappings": {
             "properties": {
                 "passage_embedding": {
                     "type": "rank_features"
                 "passage_text": {
                     "type": "text"
  5. Ingest documents using the ingest pipeline: After creating the index, you can ingest documents into it. When you index a text field, the ingest processor converts text into a vector embedding and stores it in the passage_embedding field specified in the processor:

     PUT /my-neural-sparse-index/_doc/
         "passage_text": "Hello world"

Try your engine with a query clause

Congratulations! You’ve now created your own semantic search engine based on sparse encoders. To try a sample query, invoke the _search endpoint using the neural_sparse query:

 GET /my-neural-sparse-index/_search/
    "query": {
        "neural_sparse": {
            "passage_embedding": {
                "query_text": "Hello world a b",
                "model_id": "<model_id>",
                "max_token_score": 2.0

Neural sparse query parameters

The neural_sparse query supports two parameters:

  • model_id (String): The ID of the model that is used to generate tokens and weights from the query text. A sparse encoding model will expand the tokens from query text, while the tokenizer model will only tokenize the query text itself.
  • max_token_score (Float): An extra parameter required for performance optimization. Just like a match query, a neural_sparse query is transformed to a Lucene BooleanQuery, combining term-level subqueries using disjunction. The difference is that a neural_sparse query uses FeatureQuery instead of TermQuery to match the terms. Lucene employs the Weak AND (WAND) algorithm for dynamic pruning, which skips non-competitive tokens based on their score upper bounds. However, FeatureQuery uses FLOAT.MAX_VALUE as the score upper bound, which makes the WAND optimization ineffective. The max_token_score parameter resets the score upper bound for each token in a query, which is consistent with the original FeatureQuery. Thus, setting the value to 3.5 for the bi-encoder model and to 2 for the document-only model can accelerate search without precision loss. After OpenSearch is upgraded to Lucene version 9.8, this parameter will be deprecated.

Selecting a model

OpenSearch provides several pretrained encoder models that you can use out of the box without fine-tuning. For a list of sparse encoding models provided by OpenSearch, see Sparse encoding models. We have also released the models in Hugging Face model hub.

Use the following recommendations to select a sparse encoding model:

  • For bi-encoder mode, we recommend using the opensearch-neural-sparse-encoding-v1 pretrained model. For this model, both online search and offline ingestion share the same model file.

  • For document-only mode, we recommended using the opensearch-neural-sparse-encoding-doc-v1 pretrained model for ingestion and the opensearch-neural-sparse-tokenizer-v1 model at search time to implement online query tokenization. This model does not employ model inference and only translates the query into tokens.

Next steps