corpora

The corpora element contains all the document corpora used by the workload. You can use document corpora across workloads by copying and pasting any corpora definitions.

Example

The following example defines a single corpus called movies with 11658903 documents and 1544799789 uncompressed bytes:

  "corpora": [
    {
      "name": "movies",
      "documents": [
        {
          "source-file": "movies-documents.json",
          "document-count": 11658903, # Fetch document count from command line
          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
        }
      ]
    }
  ]

Configuration options

Use the following options with corpora.

Parameter	Required	Type	Description
`name`	Yes	String	The name of the document corpus. Because OpenSearch Benchmark uses this name in its directories, use only lowercase names without white spaces.
`documents`	Yes	JSON array	An array of document files.
`meta`	No	String	A mapping of key-value pairs with additional metadata for a corpus.

Each entry in the documents array consists of the following options.

Parameter	Required	Type	Description
`source-file`	Yes	String	The file name containing the corresponding documents for the workload. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must have one JSON file containing the name.
`document-count`	Yes	Integer	The number of documents in the `source-file`, which determines which client indexes correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent/child relationship, specify the number of parent documents.
`base-url`	No	String	An http(s), Amazon Simple Storage Service (Amazon S3), or Google Cloud Storage URL that points to the root path where OpenSearch Benchmark can obtain the corresponding source file.
`source-format`	No	String	Defines the format OpenSearch Benchmark uses to interpret the data file specified in `source-file`. Only `bulk` is supported.
`compressed-bytes`	No	Integer	The size, in bytes, of the compressed source file, indicating how much data OpenSearch Benchmark downloads.
`uncompressed-bytes`	No	Integer	The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs.
`target-index`	No	String	Defines the name of the index that the `bulk` operation should target. OpenSearch Benchmark automatically derives this value when only one index is defined in the `indices` element. The value of `target-index` is ignored when the `includes-action-and-meta-data` setting is `true`.
`target-type`	No	String	Defines the document type of the target index targeted in bulk operations. OpenSearch Benchmark automatically derives this value when only one index is defined in the `indices` element and the index has only one type. The value of `target-type` is ignored when the `includes-action-and-meta-data` setting is `true`.
`includes-action-and-meta-data`	No	Boolean	When set to `true`, indicates that the document’s file already contains an `action` line and a `meta-data` line. When `false`, indicates that the document’s file contains only documents. Default is `false`.
`meta`	No	String	A mapping of key-value pairs with additional metadata for a corpus.

Example
Configuration options

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

corpora

Example

Configuration options

OpenSearch Links

Get Involved

Resources

Contact Us