Link Search Menu Expand Document Documentation Menu

You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

N-gram tokenizer

The ngram tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.

Example usage

The following example request creates a new index named my_index and configures an analyzer with an ngram tokenizer:

PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 4,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer"
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_ngram_analyzer",
  "text": "OpenSearch"
}

The response contains the generated tokens:

{
  "tokens": [
    {"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
    {"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1},
    {"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2},
    {"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3},
    {"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4},
    {"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5},
    {"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6}
  ]
}

Parameters

The ngram tokenizer can be configured with the following parameters.

Parameter Required/Optional Data type Description
min_gram Optional Integer The minimum length of the n-grams. Default is 1.
max_gram Optional Integer The maximum length of the n-grams. Default is 2.
token_chars Optional List of strings The character classes to be included in tokenization. Valid values are:
- letter
- digit
- whitespace
- punctuation
- symbol
- custom (You must also specify the custom_token_chars parameter)
Default is an empty list ([]), which retains all the characters.
custom_token_chars Optional String Custom characters to be included in the tokens.

Maximum difference between min_gram and max_gram

The maximum difference between min_gram and max_gram is configured using the index-level index.max_ngram_diff setting and defaults to 1.

The following example request creates an index with a custom index.max_ngram_diff setting:

PUT /my-index
{
  "settings": {
    "index.max_ngram_diff": 2, 
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer"
        }
      }
    }
  }
}

350 characters left

Have a question? .

Want to contribute? or .