Creating a custom analyzer
To create a custom analyzer, specify a combination of the following components:
-
Character filters (zero or more)
-
Tokenizer (one)
-
Token filters (zero or more)
Configuration
The following parameters can be used to configure a custom analyzer.
Parameter | Required/Optional | Description |
---|---|---|
type | Optional | The analyzer type. Default is custom . You can also specify a prebuilt analyzer using this parameter. |
tokenizer | Required | A tokenizer to be included in the analyzer. |
char_filter | Optional | A list of character filters to be included in the analyzer. |
filter | Optional | A list of token filters to be included in the analyzer. |
position_increment_gap | Optional | The extra spacing applied between values when indexing text fields that have multiple values. For more information, see Position increment gap. Default is 100 . |
Examples
The following examples demonstrate various custom analyzer configurations.
Custom analyzer with a character filter for HTML stripping
The following example analyzer removes HTML tags from text before tokenization:
PUT simple_html_strip_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
GET simple_html_strip_analyzer_index/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "opensearch",
"start_offset": 3,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 1
},
{
"token": "awesome!",
"start_offset": 25,
"end_offset": 42,
"type": "word",
"position": 2
}
]
}
Custom analyzer with a mapping character filter for synonym replacement
The following example analyzer replaces specific characters and patterns before applying the synonym filter:
PUT mapping_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"synonym_mapping_analyzer": {
"type": "custom",
"char_filter": ["underscore_to_space"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "synonym_filter"]
}
},
"char_filter": {
"underscore_to_space": {
"type": "mapping",
"mappings": ["_ => ' '"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"quick, fast, speedy",
"big, large, huge"
]
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
GET mapping_analyzer_index/_analyze
{
"analyzer": "synonym_mapping_analyzer",
"text": "The slow_green_turtle is very large"
}
The response contains the generated tokens:
{
"tokens": [
{"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
{"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
{"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
{"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
{"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
{"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
{"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
]
}
Custom analyzer with a custom pattern-based character filter for number normalization
The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches:
PUT advanced_pattern_replace_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"phone_number_analyzer": {
"type": "custom",
"char_filter": ["phone_normalization"],
"tokenizer": "standard",
"filter": ["lowercase", "edge_ngram"]
}
},
"char_filter": {
"phone_normalization": {
"type": "pattern_replace",
"pattern": "[-\\s]",
"replacement": ""
}
},
"filter": {
"edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
GET advanced_pattern_replace_analyzer_index/_analyze
{
"analyzer": "phone_number_analyzer",
"text": "123-456 7890"
}
The response contains the generated tokens:
{
"tokens": [
{"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
]
}
Position increment gap
The position_increment_gap
parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don’t match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to 0
in order to allow phrases to span across array values.
The following example demonstrates the effect of position_increment_gap
using a match_phrase
query.
-
Index a document in a
test-index
:PUT test-index/_doc/1 { "names": [ "Slow green", "turtle swims"] }
-
Query the document using a
match_phrase
query:GET test-index/_search { "query": { "match_phrase": { "names": { "query": "green turtle" } } } }
The response returns no hits because the distance between the terms
green
andturtle
is100
(the defaultposition_increment_gap
). -
Now query the document using a
match_phrase
query with aslop
parameter that is higher than theposition_increment_gap
:GET test-index/_search { "query": { "match_phrase": { "names": { "query": "green turtle", "slop": 101 } } } }
The response contains the matching document:
{ "took": 4, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.010358453, "hits": [ { "_index": "test-index", "_id": "1", "_score": 0.010358453, "_source": { "names": [ "Slow green", "turtle swims" ] } } ] } }