Link Search Menu Expand Document Documentation Menu

Keyword tokenizer

The keyword tokenizer ingests text and outputs it exactly as a single, unaltered token. This makes it particularly useful when you want the input to remain intact, such as when managing structured data like names, product codes, or email addresses.

The keyword tokenizer can be paired with token filters to process the text, for example, to normalize it or to remove extraneous characters.

Example usage

The following example request creates a new index named my_index and configures an analyzer with a keyword tokenizer:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword_analyzer": {
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_keyword_analyzer"
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_keyword_analyzer",
  "text": "OpenSearch Example"
}

The response contains the single token representing the original text:

{
  "tokens": [
    {
      "token": "OpenSearch Example",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    }
  ]
}

Parameters

The keyword token filter can be configured with the following parameter.

Parameter Required/Optional Data type Description
buffer_size Optional Integer Determines the character buffer size. Default is 256. There is usually no need to change this setting.

Combining the keyword tokenizer with token filters

To enhance the functionality of the keyword tokenizer, you can combine it with token filters. Token filters can transform the text, such as converting it to lowercase or removing unwanted characters.

Example: Using the pattern_replace filter and keyword tokenizer

In this example, the pattern_replace filter uses a regular expression to replace all non-alphanumeric characters with an empty string:

POST _analyze
{
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "[^a-zA-Z0-9]",
      "replacement": ""
    }
  ],
  "text": "Product#1234-XYZ"
}

The pattern_replace filter removes non-alphanumeric characters and returns the following token:

{
  "tokens": [
    {
      "token": "Product1234XYZ",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .