Link Search Menu Expand Document Documentation Menu

Phonetic token filter

The phonetic token filter transforms tokens into their phonetic representations, enabling more flexible matching of words that sound similar but are spelled differently. This is particularly useful for searching names, brands, or other entities that users might spell differently but pronounce similarly.

The phonetic token filter is not included in OpenSearch distributions by default. To use this token filter, you must first install the analysis-phonetic plugin as follows and then restart OpenSearch:

./bin/opensearch-plugin install analysis-phonetic

For more information about installing plugins, see Installing plugins.

Parameters

The phonetic token filter can be configured with the following parameters.

Parameter Required/Optional Data type Description
encoder Optional String Specifies the phonetic algorithm to use.

Valid values are:
- metaphone (default)
- double_metaphone
- soundex
- refined_soundex
- caverphone1
- caverphone2
- cologne
- nysiis
- koelnerphonetik
- haasephonetik
- beider_morse
- daitch_mokotoff
replace Optional Boolean Whether to replace the original token. If false, the original token is included in the output along with the phonetic encoding. Default is true.

Example

The following example request creates a new index named names_index and configures an analyzer with a phonetic filter:

PUT /names_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_phonetic_filter": {
          "type": "phonetic",
          "encoder": "double_metaphone",
          "replace": true
        }
      },
      "analyzer": {
        "phonetic_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_phonetic_filter"
          ]
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated for the names Stephen and Steven using the analyzer:

POST /names_index/_analyze
{
  "text": "Stephen",
  "analyzer": "phonetic_analyzer"
}

POST /names_index/_analyze
{
  "text": "Steven",
  "analyzer": "phonetic_analyzer"
}

In both cases, the response contains the same generated token:

{
  "tokens": [
    {
      "token": "STFN",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .