N-gram tokenizer
The ngram
tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.
Example usage
The following example request creates a new index named my_index
and configures an analyzer with an ngram
tokenizer:
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"my_ngram_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /my_index/_analyze
{
"analyzer": "my_ngram_analyzer",
"text": "OpenSearch"
}
The response contains the generated tokens:
{
"tokens": [
{"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
{"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1},
{"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2},
{"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3},
{"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4},
{"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5},
{"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6}
]
}
Parameters
The ngram
tokenizer can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
min_gram | Optional | Integer | The minimum length of the n-grams. Default is 1 . |
max_gram | Optional | Integer | The maximum length of the n-grams. Default is 2 . |
token_chars | Optional | List of strings | The character classes to be included in tokenization. Valid values are: - letter - digit - whitespace - punctuation - symbol - custom (You must also specify the custom_token_chars parameter)Default is an empty list ( [] ), which retains all the characters. |
custom_token_chars | Optional | String | Custom characters to be included in the tokens. |
Maximum difference between min_gram
and max_gram
The maximum difference between min_gram
and max_gram
is configured using the index-level index.max_ngram_diff
setting and defaults to 1
.
The following example request creates an index with a custom index.max_ngram_diff
setting:
PUT /my-index
{
"settings": {
"index.max_ngram_diff": 2,
"analysis": {
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"my_ngram_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
}
}
}
}