Kuromoji completion token filter

The kuromoji_completion token filter is used to stem Katakana words in Japanese, which are often used to represent foreign words or loanwords. This filter is especially useful for autocompletion or suggest queries, in which partial matches on Katakana words can be expanded to include their full forms.

To use this token filter, you must first install the analysis-kuromoji plugin on all nodes by running bin/opensearch-plugin install analysis-kuromoji and then restart the cluster. For more information about installing additional plugins, see Additional plugins.

Example

The following example request creates a new index named kuromoji_sample and configures an analyzer with a kuromoji_completion filter:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_completion"
          }
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer with text that translates to “use a computer”:

POST /kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "コンピューターを使う"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "コンピューター", // The original Katakana word "computer".
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "konpyuーtaー", // Romanized version (Romaji) of "コンピューター".
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "konnpyuーtaー", // Another possible romanized version of "コンピューター" (with a slight variation in the spelling).
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "を", // A Japanese particle, "wo" or "o"
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "wo", // Romanized form of the particle "を" (often pronounced as "o").
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "o", // Another version of the romanization.
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "使う", // The verb "use" in Kanji.
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 2
    },
    {
      "token": "tukau", // Romanized version of "使う"
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 2
    },
    {
      "token": "tsukau", // Another romanized version of "使う", where "tsu" is more phonetically correct
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 2
    }
  ]
}

Example
Generated tokens

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Kuromoji completion token filter

Example

Generated tokens

OpenSearch Links

Get Involved

Resources

Contact Us