Scaling neural sparse search to billions of vectors with approximate search

Neural sparse search combines the semantic understanding of neural models with the efficiency of sparse vector representations. This technique has proven effective for semantic retrieval while maintaining the advantages of traditional lexical search, offering better explanation and presentation of results through text matching. As these sparse embeddings have gained popularity, the increased index sizes have introduced scalability challenges.

Traditional search methods suffer from increasing query latency as collections grow. This degradation in query throughput can significantly impact the user experience. In OpenSearch 2.15, we took an important first step toward addressing this challenge by introducing the two-phase search processor. By dynamically pruning tokens with negligible weights, this approach reduces the computational load while preserving search relevance.

With this same goal of efficiency and high-quality retrieval, we recently introduced sparse approximate nearest neighbor (ANN) search based on the SEISMIC (Spilled Clustering of Inverted Lists with Summaries for Maximum Inner Product Search) approximate retrieval algorithm for sparse vectors. This algorithm fundamentally changes what’s possible in large-scale search. Today, we’re excited to announce that the SEISMIC algorithm is now available in OpenSearch 3.3. This algorithm achieves faster query latency than traditional BM25 while preserving the semantic understanding of neural sparse models, representing a fundamental shift in the search performance landscape.

The SEISMIC algorithm

A SEISMIC index consists of two components: an inverted index and a forward index. The forward index maps each document ID to its sparse embedding, where each embedding component represents a token ID and its corresponding weight. The inverted index, on the other hand, applies multiple pruning techniques to optimize efficiency and reduce storage:

Clustered posting list: For each term in the inverted index, the SEISMIC algorithm sorts documents by token weights, retains only the top documents, and applies clustering in order to group similar documents.
Sparse vector summarization: Each cluster maintains a summary vector containing only the highest-weighted tokens, enabling efficient pruning during query time.
Query-time pruning: During query execution, the SEISMIC algorithm employs token-level and cluster-level pruning to dramatically reduce the number of documents that need scoring.

The data organization in the forward and inverted indexes is presented in the following image.

Try it out

To try SEISMIC-based ANN sparse search, follow these steps.

Step 1: Create an index

Create a sparse index by setting index.sparse to true and define a sparse_vector field in the index mapping:

PUT sparse-vector-index
{
  "settings": {
    "index": {
      "sparse": true
    }
  },
  "mappings": {
    "properties": {
      "sparse_embedding": {
        "type": "sparse_vector",
        "method": {
          "name": "seismic",
          "parameters": {
            "approximate_threshold": 1
          }
        }
      }
    }
  }
}

Step 2: Ingest data into the index

Ingest three documents containing sparse_vector fields into your index:

PUT sparse-vector-index/_doc/1
{
  "sparse_embedding": {
    "1000": 0.1
  }
}

PUT sparse-vector-index/_doc/2
{
  "sparse_embedding": {
    "2000": 0.2
  }
}

PUT sparse-vector-index/_doc/3
{
  "sparse_embedding": {
    "3000": 0.3
  }
}

Step 3: Search the index

You can query the sparse index by providing either raw vectors or natural language using a neural sparse query.

Query using a raw vector

To query using a raw vector, provide the query_tokens parameter:

GET sparse-vector-index/_search
{
  "query": {
    "neural_sparse": {
      "sparse_embedding": {
        "query_tokens": {
          "1000": 5.5
        },
        "method_parameters": {
          "heap_factor": 1.0,
          "top_n": 10,
          "k": 10
        }
      }
    }
  }
}

Query using natural language

To query using natural language, provide the query_text and model_id parameters:

GET sparse-vector-index/_search
{
  "query": {
    "neural_sparse": {
      "sparse_embedding": {
        "query_text": "<input text>",
        "model_id": "<model ID>",
        "method_parameters": {
          "k": 10,
          "top_n": 10,
          "heap_factor": 1.0
        }
      }
    }
  }
}

Benchmark experiments: Comparing SEISMIC and traditional approaches

We conducted a billion-scale benchmark to compare the performance of the SEISMIC algorithm and traditional search methods, including BM25, neural sparse search, and two-phase search.

Experimental setup

Corpus set: C4 dataset from Dolma. After preprocessing, the dataset is chunked into 1,285,526,507 (approximately, 1.29 billion) documents.
Query set: MS MARCO v1 dev set with 6,980 queries.
Sparse embedding models: We used the doc-only mode with two models:
- Corpus encoding: amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v3-gte
- Query encoding: amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1
OpenSearch cluster: An OpenSearch cluster running version 3.3
- Cluster manager nodes: 3 m7g.4x large instances
- Data nodes: 15 r7g.12x large instances

Benchmarking results

For our billion-scale benchmark, we evenly split the dataset into 10 partitions. After each partition was ingested, we ran a force merge to build a new SEISMIC segment. This approach resulted in 10 SEISMIC segments per data node, with each segment containing approximately 8.5 million documents.

Following the Big ANN benchmark, we focused on the query performance when recall @ 10 reaches 90%. We considered two experimental setups: single-threaded and multithreaded. For the single-threaded setup, metrics were collected using a Python script. Latency was measured using the took time returned by OpenSearch queries. For the multithreaded setup, throughput metrics were measured using opensearch-benchmark with four threads in total. The benchmarking results are presented in the following table.

Table I: Comparison between neural sparse, BM25, and SEISMIC queries

Metrics		Neural sparse	Neural sparse two phase	BM25	SEISMIC
Recall @ 10 (%)		100	90.483	N/A	90.209
Single-threaded	Average latency (ms)	125.12	45.62	41.52	11.77
	P50 latency (ms)	109	34	28	11
	P90 latency (ms)	226	100	90	16
	P99 latency (ms)	397.21	200.21	200.21	27
	P99.9 latency (ms)	551.15	296.53	346.06	50.02
Multithreaded	Mean throughput (op/s)	26.35	82.05	85.86	158.7

Given 90% recall, the SEISMIC algorithm achieved an average query time of merely 11.77 ms—nearly 4x faster than BM25 (41.52 ms) and over 10x faster than standard neural sparse search (125.12 ms). In a multithreaded setup, SEISMIC demonstrated a significant throughput advantage—handling 158.7 operations per second, nearly double BM25’s throughput (85.86 op/s)—while maintaining comparable recall to the two-phase approach.

In total, the force merge time was 2 hours, 58 minutes, and 30 seconds. On average, memory consumption was approximately 53 GB per data node for storing SEISMIC data.

Best practices

Based on our benchmarking results, here are some best practices for working with the SEISMIC algorithm:

Recommended segment size: Set the approximate threshold to 5M documents and force merge segments to between 5M and 10M documents for best performance.
Memory planning: Plan for approximately 1 GB of memory per 1M documents when selecting appropriate instance types and sizing the cluster.

These results demonstrate that the SEISMIC algorithm delivers unprecedented performance for billion-scale search applications, outperforming even traditional BM25 while maintaining the semantic understanding of neural sparse models.

Conclusion

In OpenSearch 3.3, we’re introducing an approximate retrieval algorithm for sparse vectors. This algorithm delivers faster query times than BM25 while maintaining the semantic understanding of neural sparse models. For billion-scale search applications, it removes key scalability barriers by dramatically reducing query latency, allowing you to achieve better performance with fewer nodes.

The capability to search across billions of documents with latency of less than 12 ms changes what’s possible in information retrieval. We’re excited to see you use this technology to build the next generation of semantic search applications. As always, we’d love to hear your feedback—join the conversation and share your experiences on the OpenSearch forum.

Authors

Yuye Zhu

Yuye Zhu is a machine learning engineer working on the OpenSearch Project. He primarily works on neural sparse search, agent framework, text chunking, and retrieval-augmented generation.

View all posts
Liyun Xiu

Liyun Xiu is an AWS senior engineer working on the OpenSearch Project. He focuses on building large-scale applications, performance optimization, and neural sparse search.

View all posts
Zirui Song

Zirui Song is a machine learning engineer working on the OpenSearch Project. His current work focuses on search relevance modeling, system optimization, and integrating machine learning into modern search infrastructure.

View all posts
Charlie Yang

Charlie Yang is an AWS engineering manager with the OpenSearch Project. He focuses on machine learning, search relevance, and performance optimization.

View all posts
Fanit Kolchina

Fanit Kolchina is a senior programmer writer at AWS focusing on OpenSearch.

View all posts

Scaling neural sparse search to billions of vectors with approximate search

The SEISMIC algorithm

Try it out

Step 1: Create an index

Step 2: Ingest data into the index

Step 3: Search the index

Query using a raw vector

Query using natural language

Benchmark experiments: Comparing SEISMIC and traditional approaches

Experimental setup

Benchmarking results

Best practices

Conclusion

Further reading

Authors

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.

Participate

Providers

Resources

Scaling neural sparse search to billions of vectors with approximate search

The SEISMIC algorithm

Try it out

Step 1: Create an index

Step 2: Ingest data into the index

Step 3: Search the index

Query using a raw vector

Query using natural language

Benchmark experiments: Comparing SEISMIC and traditional approaches

Experimental setup

Benchmarking results

Best practices

Conclusion

Further reading

Share or Summarize with AI

Authors

Participate

Providers

Resources