You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

N-gram tokenizer

The ngram tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.

Example usage

The following example request creates a new index named my_index and configures an analyzer with an ngram tokenizer:

PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 4,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer"
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_ngram_analyzer",
  "text": "OpenSearch"
}

The response contains the generated tokens:

{
  "tokens": [
    {"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
    {"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1},
    {"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2},
    {"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3},
    {"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4},
    {"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5},
    {"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6}
  ]
}

Parameters

The ngram tokenizer can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`min_gram`	Optional	Integer	The minimum length of the n-grams. Default is `1`.
`max_gram`	Optional	Integer	The maximum length of the n-grams. Default is `2`.
`token_chars`	Optional	List of strings	The character classes to be included in tokenization. Valid values are: - `letter` - `digit` - `whitespace` - `punctuation` - `symbol` - `custom` (You must also specify the `custom_token_chars` parameter) Default is an empty list (`[]`), which retains all the characters.
`custom_token_chars`	Optional	String	Custom characters to be included in the tokens.

Maximum difference between `min_gram` and `max_gram`

The maximum difference between min_gram and max_gram is configured using the index-level index.max_ngram_diff setting and defaults to 1.

The following example request creates an index with a custom index.max_ngram_diff setting:

PUT /my-index
{
  "settings": {
    "index.max_ngram_diff": 2, 
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer"
        }
      }
    }
  }
}

Example usage
Generated tokens
Parameters
- Maximum difference between min_gram and max_gram

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

N-gram tokenizer

Example usage

Generated tokens

Parameters

Maximum difference between `min_gram` and `max_gram`

OpenSearch Links

Get Involved

Resources

Contact Us

N-gram tokenizer

Example usage

Generated tokens

Parameters

Maximum difference between min_gram and max_gram

Maximum difference between `min_gram` and `max_gram`