You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
N-gram tokenizer
The ngram
tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.
Example usage
The following example request creates a new index named my_index
and configures an analyzer with an ngram
tokenizer:
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"my_ngram_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /my_index/_analyze
{
"analyzer": "my_ngram_analyzer",
"text": "OpenSearch"
}
The response contains the generated tokens:
{
"tokens": [
{"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
{"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1},
{"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2},
{"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3},
{"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4},
{"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5},
{"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6}
]
}
Parameters
The ngram
tokenizer can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
min_gram | Optional | Integer | The minimum length of the n-grams. Default is 1 . |
max_gram | Optional | Integer | The maximum length of the n-grams. Default is 2 . |
token_chars | Optional | List of strings | The character classes to be included in tokenization. Valid values are: - letter - digit - whitespace - punctuation - symbol - custom (You must also specify the custom_token_chars parameter)Default is an empty list ( [] ), which retains all the characters. |
custom_token_chars | Optional | String | Custom characters to be included in the tokens. |
Maximum difference between min_gram
and max_gram
The maximum difference between min_gram
and max_gram
is configured using the index-level index.max_ngram_diff
setting and defaults to 1
.
The following example request creates an index with a custom index.max_ngram_diff
setting:
PUT /my-index
{
"settings": {
"index.max_ngram_diff": 2,
"analysis": {
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"my_ngram_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
}
}
}
}