You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

Creating a custom analyzer

To create a custom analyzer, specify a combination of the following components:

Character filters (zero or more)
Tokenizer (one)
Token filters (zero or more)

Configuration

The following parameters can be used to configure a custom analyzer.

Parameter	Required/Optional	Description
`type`	Optional	The analyzer type. Default is `custom`. You can also specify a prebuilt analyzer using this parameter.
`tokenizer`	Required	A tokenizer to be included in the analyzer.
`char_filter`	Optional	A list of character filters to be included in the analyzer.
`filter`	Optional	A list of token filters to be included in the analyzer.
`position_increment_gap`	Optional	The extra spacing applied between values when indexing text fields that have multiple values. For more information, see Position increment gap. Default is `100`.

Examples

The following examples demonstrate various custom analyzer configurations.

Custom analyzer with a character filter for HTML stripping

The following example analyzer removes HTML tags from text before tokenization:

PUT simple_html_strip_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET simple_html_strip_analyzer_index/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "opensearch",
      "start_offset": 3,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 1
    },
    {
      "token": "awesome!",
      "start_offset": 25,
      "end_offset": 42,
      "type": "word",
      "position": 2
    }
  ]
}

Custom analyzer with a mapping character filter for synonym replacement

The following example analyzer replaces specific characters and patterns before applying the synonym filter:

PUT mapping_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "synonym_mapping_analyzer": {
          "type": "custom",
          "char_filter": ["underscore_to_space"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "synonym_filter"]
        }
      },
      "char_filter": {
        "underscore_to_space": {
          "type": "mapping",
          "mappings": ["_ => ' '"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "quick, fast, speedy",
            "big, large, huge"
          ]
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET mapping_analyzer_index/_analyze
{
  "analyzer": "synonym_mapping_analyzer",
  "text": "The slow_green_turtle is very large"
}

The response contains the generated tokens:

{
  "tokens": [
    {"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
    {"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
    {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
    {"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
    {"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
    {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
    {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
  ]
}

Custom analyzer with a custom pattern-based character filter for number normalization

The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches:

PUT advanced_pattern_replace_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "phone_number_analyzer": {
          "type": "custom",
          "char_filter": ["phone_normalization"],
          "tokenizer": "standard",
          "filter": ["lowercase", "edge_ngram"]
        }
      },
      "char_filter": {
        "phone_normalization": {
          "type": "pattern_replace",
          "pattern": "[-\\s]",
          "replacement": ""
        }
      },
      "filter": {
        "edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET advanced_pattern_replace_analyzer_index/_analyze
{
  "analyzer": "phone_number_analyzer",
  "text": "123-456 7890"
}

The response contains the generated tokens:

{
  "tokens": [
    {"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
  ]
}

Position increment gap

The position_increment_gap parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don’t match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to 0 in order to allow phrases to span across array values.

The following example demonstrates the effect of position_increment_gap using a match_phrase query.

Index a document in a test-index:

  PUT test-index/_doc/1
  {
    "names": [ "Slow green", "turtle swims"]
  }

Query the document using a match_phrase query:

 GET test-index/_search
 {
   "query": {
     "match_phrase": {
       "names": {
         "query": "green turtle" 
       }
     }
   }
 }

The response returns no hits because the distance between the terms green and turtle is 100 (the default position_increment_gap).

Now query the document using a match_phrase query with a slop parameter that is higher than the position_increment_gap:

 GET test-index/_search
 {
   "query": {
     "match_phrase": {
       "names": {
         "query": "green turtle",
         "slop": 101
       }
     }
   }
 }

The response contains the matching document:

 {
   "took": 4,
   "timed_out": false,
   "_shards": {
     "total": 1,
     "successful": 1,
     "skipped": 0,
     "failed": 0
   },
   "hits": {
     "total": {
       "value": 1,
       "relation": "eq"
     },
     "max_score": 0.010358453,
     "hits": [
       {
         "_index": "test-index",
         "_id": "1",
         "_score": 0.010358453,
         "_source": {
           "names": [
             "Slow green",
             "turtle swims"
           ]
         }
       }
     ]
   }
 }

Configuration
Examples
Position increment gap

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Creating a custom analyzer

Configuration

Examples

Custom analyzer with a character filter for HTML stripping

Custom analyzer with a mapping character filter for synonym replacement

Custom analyzer with a custom pattern-based character filter for number normalization

Position increment gap

OpenSearch Links

Get Involved

Resources

Contact Us