THE LINUX FOUNDATION PROJECTS

Backing up your OpenSearch indexes via the snapshot process is vital for disaster recovery. Snapshots allow teams to restore indexed data, cluster configuration, and state if something goes wrong. Teams also rely on snapshots and restore processes when migrating data between clusters (for example, between development and production) or during OpenSearch version upgrades. At Aiven, engineers helped a customer discover why their snapshots were failing. The culprit? A single space character.

TL;DR

A single space in a KNN field name broke OpenSearch snapshot backups, Aiven tracked it down and helped get it fixed.

  • Snapshot file validation rejected filenames with spaces, but the KNN plugin allowed spaces in field names that become part of filenames
  • The bug was hard to trace due to verbose error messages and inconsistent behavior across storage backends
  • Aiven engineers isolated the issue through methodical experimentation, narrowing it from a generic storage problem to a KNN-specific naming conflict
  • A bug was raised upstream and fixed in OpenSearch v2.17
  • Aiven added upgrade guardrails to block affected customers from hitting the issue in the meantime

Chasing the error

Bugs like this can be hard to trace. Snapshots take time to create when there’s a lot of data in an index. OpenSearch can be configured to log errors, but it’s not always obvious what an error message actually means in practice. In this case, the error message produced was verbose (note this is a recreation of the message, not our actual customer’s details):

Jul 11 13:02:01 os-knn-test-1 opensearch[87]: [2024-07-11T13:02:01,536][WARN ][o.o.r.b.BlobStoreRepository] [os-knn-test-1] [opensearch-20240711t125920832386z-frequent/IjyRjHELRT-9SQYOUCq5nQ] failed to delete shard data for shard [knn-index][0] org.opensearch.OpenSearchParseException: missing or invalid physical file name [_0_2011_my vector.hnswc]| at org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.fromXContent(BlobStoreIndexShardSnapshot.java:344) ~[opensearch-2.15.0.jar:2.15.0]| at org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshots.fromXContent(BlobStoreIndexShardSnapshots.java:298) ~[opensearch-2.15.0.jar:2.15.0]| at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.deserialize(ChecksumBlobStoreFormat.java:144) ~[opensearch-2.15.0.jar:2.15.0]| at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:119) ~[opensearch-2.15.0.jar:2.15.0]| at org.opensearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3587) ~[opensearch-2.15.0.jar:2.15.0]| at org.opensearch.repositories.blobstore.BlobStoreRepository$3.doRun(BlobStoreRepository.java:1388) [opensearch-2.15.0.jar:2.15.0]| at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:941) [opensearch-2.15.0.jar:2.15.0]| at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.15.0.jar:2.15.0]| at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]| at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]| at java.base/java.lang.Thread.run(Thread.java:840) [?:?]|

The underlying storage architecture is failing, because it can’t delete a file, and the issue relates to the filename itself. It’s unclear if the file has disappeared, changed names, or if something else is happening. It could be a network issue, or something related to the storage system itself.

Digging in with experimentation

The first experiment was to change where the backups were being made by switching the Amazon S3 volume. This briefly seemed to work. The first backup succeeded, but then the problem reappeared.

The next step was to examine which file is actually being snapshotted. From the shard name knn-index we can conclude it’s part of the structure that allows OpenSearch to provide efficient vector search using K-Nearest Neighbours. The file is failing to be deleted in some way. After a few more experiments the behavior proved consistent. This isn’t a generic issue with the OpenSearch storage backend. It’s specific to the KNN index.

Naming our backups

Time for a flash of inspiration.Why might it be hard to delete this file? The filename contains something unusual: a space character between “my” and “vector.” This is perfectly allowable on Linux, Windows and other modern operating systems, although others say, “spaces in filenames aren’t any problem, except when they are.”  Maybe this is the issue?

It turns out that the snapshot repository’s file validation rejected filenames containing spaces, while the KNN plugin happily allowed spaces to be used in field names that get embedded into segment filenames.. The next question was how to fix the problem. As OpenSearch is open source, a bug was raised. Others in the community soon confirmed that they encountered the same problem.

A fix and a workaround

Even with a bug raised, in an open source project there is no guarantee when the issue will be fixed. The OpenSearch team proved responsive, and the fix eventually landed with OpenSearch v2.17. In the meantime, Aiven advised customers to avoid spaces in index names. Because the problem often appeared when upgrading to OpenSearch v2.x, Aiven added a guardrail to block these upgrades when a filename contained spaces.

Not every team has the time or knowledge to dive deeply into the open source code they depend on. Aiven provides a service hosted on OpenSearch designed to make it easy for customers to concentrate on their own offering without worrying about the underlying search engine and built for high reliability. Deep experience, an experimental approach, and the occasional flash of inspiration make Aiven a community resource and expert OpenSearch solution provider.

Authors

  • Alexie Atavin is a Staff Software Engineer at Aiven, where he works within the metrics and logs team. His focus is on the continuous improvement of Aiven's OpenSearch, Redis, and Grafana offerings, driving automation code that powers these managed services.

    View all posts
  • Charlie is a leading figure in the search industry, known for an honest, neutral and pragmatic viewpoint; he has held multiple roles including senior consultant, strategic advisor, project manager, sales & marketing director, conference organiser & speaker, trainer, writer & mentor. He is deeply connected with the business & technology of website and enterprise search engines. His past experience in software engineering gives me a highly informed perspective on search technology with a particular focus on open source platforms such as LuceneApache SolrElasticsearchOpenSearch & Vespa.  More recently he has helped several companies use traditional & modern AI techniques to supercharge search – here’s some recommendations from the team at Moonpig who he helped in 2025.

    View all posts