Link Search Menu Expand Document Documentation Menu

Dictionary decompounder token filter

The dictionary_decompounder token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The dictionary_decompounder token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token.

Parameters

The dictionary_decompounder token filter has the following parameters.

Parameter Required/Optional Data type Description
word_list Required unless word_list_path is configured Array of strings The dictionary of words that the filter uses to split compound words.
word_list_path Required unless word_list is configured String A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the config directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line.
min_word_size Optional Integer The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is 5.
min_subword_size Optional Integer The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is 2.
max_subword_size Optional Integer The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is 15.
only_longest_match Optional Boolean If set to true, only the longest matching subword will be returned. Default is false.

Example

The following example request creates a new index named decompound_example and configures an analyzer with the dictionary_decompounder filter:

PUT /decompound_example
{
  "settings": {
    "analysis": {
      "filter": {
        "my_dictionary_decompounder": {
          "type": "dictionary_decompounder",
          "word_list": ["slow", "green", "turtle"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_dictionary_decompounder"]
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /decompound_example/_analyze
{
  "analyzer": "my_analyzer",
  "text": "slowgreenturtleswim"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "slowgreenturtleswim",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "slow",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "green",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "turtle",
      "start_offset": 0,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .