Error log: SnapshotFailedException - The backup that broke

Error log: You attempt to take a snapshot, but it fails. When you check the status, you see this error, or a similar one in your OpenSearch logs:

JSON

None
{

  "error": {

    "root_cause": [

      {

        "type": "snapshot_failed_exception",

        "reason": "[my_s3_repo:my_snapshot_01] snapshot failed"

      }

    ],

    "type": "snapshot_failed_exception",

    "reason": "[my_s3_repo:my_snapshot_01] snapshot failed",

    "caused_by": {

      "type": "repository_exception",

      "reason": "[my_s3_repo] could not write file [snapshot-my_snapshot_01.dat]"

      // "caused_by" will have the specific reason, e.g., "Access Denied"

    }

  },

  "status": 500

}

Why… is this happening? This is a general error indicating that your snapshot started but could not be completed. This is different from RepositoryMissingException (where the location was never found). This error means OpenSearch could talk to the repository, but something went wrong during the data transfer.

The true problem is always inside the caused_by or reason string:

Permission denied (most common): The OpenSearch nodes have permission to register the repository (e.g., ListBucket in S3) but lack permission to write data (e.g., PutObject in S3). This is a common IAM role issue.
A node went offline: A data node holding a primary shard for an index in your snapshot went offline during the snapshot process. The cluster manager node can’t copy data from a missing node, so it fails the snapshot.
Shard read failure: A shard being snapshotted is corrupted or unreadable.
Repository is read-only: The repository itself (e.g., a shared file system) has been mounted as read-only.
Disk space (for fs type): The shared file system repository is full.

Best Practice:

Examine the caused_by: This is the most critical step. If it says “Access Denied,” “403 Forbidden,” or similar, the problem is 100% your cloud permissions (e.g., S3 IAM role, Azure permissions).
Check repository permissions: For S3, ensure your role grants s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket on the bucket and all objects within it.
Check node logs: The full stack trace for the failure will be in the opensearch.log file, often on the cluster manager node (which coordinates the snapshot) or the data node that failed to send its shard.
Check cluster health: Never run a snapshot if your cluster is red or yellow. A snapshot can only back up active shards. Run GET /_cluster/health first.
Try POST /_snapshot/my_repo/_verify: This command runs a quick check to see if all nodes can write to the repository. If this fails, it’s a clear sign of a permission or network issue.

What else can I do? Snapshot failures are often related to tricky cloud permissions. If you’re stuck on an IAM policy or a file system mount, the OpenSearch community can help debug. For direct support, contact us in The OpenSearch Slack Channel in #General.

Author

OpenSearch

View all posts

Error log: SnapshotFailedException – The backup that broke

Author

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.

Participate

Providers

Resources

Error log: SnapshotFailedException – The backup that broke

Share or Summarize with AI

Author

Participate

Providers

Resources