Edge n-gram tokenizer
The edge_ngram
tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens within a defined minimum and maximum length range. This tokenizer is particularly useful for implementing search-as-you-type functionality.
Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as when searching for product names or addresses. For more information, see Autocomplete. However, for text with a fixed order, like movie or song titles, the completion suggester may be more accurate.
By default, the edge n-gram
tokenizer produces tokens with a minimum length of 1
and a maximum length of 2
. For example, when analyzing the text OpenSearch
, the default configuration will produce the O
and Op
n-grams. These short n-grams often match too many irrelevant terms, so configuring the tokenizer is necessary in order to adjust the n-gram lengths.
Example usage
The following example request creates a new index named my_index
and configures an analyzer with an edge_ngram
tokenizer. The tokenizer produces tokens 3–6 characters in length, considering both letters and symbols to be valid token characters:
PUT /edge_n_gram_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "my_custom_tokenizer"
}
},
"tokenizer": {
"my_custom_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter" ]
}
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /edge_n_gram_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Code 42 rocks!"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "Cod",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "Code",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "roc",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "rock",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 3
},
{
"token": "rocks",
"start_offset": 8,
"end_offset": 13,
"type": "word",
"position": 4
}
]
}
Parameters
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
min_gram | Optional | Integer | The minimum token length. Default is 1 . |
max_gram | Optional | Integer | The maximum token length. Default is 2 . |
custom_token_chars | Optional | String | Defines custom characters to be treated as part of a token (for example, +-_ ). |
token_chars | Optional | Array of strings | Defines character classes to include in tokens. Tokens are split on characters not included in these classes. Default includes all characters. Available classes include: - letter : Alphabetic characters (for example, a , ç , or 京 ) - digit : Numeric characters (for example, 3 or 7 ) - punctuation : Punctuation symbols (for example, ! or ? ) - symbol : Other symbols (for example, $ or √ ) - whitespace : Space or newline characters - custom : Allows you to specify custom characters in the custom_token_chars setting. |
max_gram parameter limitations
The max_gram
parameter sets the maximum length of tokens generated by the tokenizer. When a search query exceeds this length, it may fail to match any terms in the index.
For example, if max_gram
is set to 4
, the query explore
would be tokenized as expl
during indexing. As a result, a search for the full term explore
will not match the indexed token expl
.
To address this limitation, you can apply a truncate
token filter to shorten search terms to the maximum token length. However, this approach presents trade-offs. Truncating explore
to expl
might lead to matches with unrelated terms like explosion
or explicit
, reducing search precision.
We recommend carefully balancing the max_gram
value to ensure efficient tokenization while minimizing irrelevant matches. If precision is critical, consider alternative strategies, such as adjusting query analyzers or fine-tuning filters.
Best practices
We recommend using the edge_ngram
tokenizer only at indexing time in order to ensure that partial word tokens are stored. At search time, a basic analyzer should be used to match all query terms.
Configuring search-as-you-type functionality
To implement search-as-you-type functionality, use the edge_ngram
tokenizer during indexing and an analyzer that performs minimal processing at search time. The following example demonstrates this approach.
Create an index with an edge_ngram
tokenizer:
PUT /my-autocomplete-index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Index a document containing a product
field and refresh the index:
PUT my-autocomplete-index/_doc/1?refresh
{
"title": "Laptop Pro"
}
This configuration ensures that the edge_ngram
tokenizer breaks terms like “Laptop” into tokens such as La
, Lap
, and Lapt
, allowing partial matches during search. At search time, the standard
tokenizer simplifies queries while ensuring that matches are case-insensitive because of the lowercase filter.
Searches for laptop Pr
or lap pr
now retrieve the relevant document based on partial matches:
GET my-autocomplete-index/_search
{
"query": {
"match": {
"title": {
"query": "lap pr",
"operator": "and"
}
}
}
}
For more information, see Search as you type.