You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

Simple pattern split tokenizer

The simple_pattern_split tokenizer uses a regular expression to split text into tokens. The regular expression defines the pattern used to determine where to split the text. Any matching pattern in the text is used as a delimiter, and the text between delimiters becomes a token. Use this tokenizer when you want to define delimiters and tokenize the rest of the text based on a pattern.

The tokenizer uses the matched parts of the input text (based on the regular expression) only as delimiters or boundaries to split the text into terms. The matched portions are not included in the resulting terms. For example, if the tokenizer is configured to split text at dot characters (.) and the input text is one.two.three, then the generated terms are one, two, and three. The dot characters themselves are not included in the resulting terms.

Example usage

The following example request creates a new index named my_index and configures an analyzer with a simple_pattern_split tokenizer. The tokenizer is configured to split text on hyphens:

PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_pattern_split_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "-"
        }
      },
      "analyzer": {
        "my_pattern_split_analyzer": {
          "type": "custom",
          "tokenizer": "my_pattern_split_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_pattern_split_analyzer"
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_pattern_split_analyzer",
  "text": "OpenSearch-2024-10-09"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "OpenSearch",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    },
    {
      "token": "2024",
      "start_offset": 11,
      "end_offset": 15,
      "type": "word",
      "position": 1
    },
    {
      "token": "10",
      "start_offset": 16,
      "end_offset": 18,
      "type": "word",
      "position": 2
    },
    {
      "token": "09",
      "start_offset": 19,
      "end_offset": 21,
      "type": "word",
      "position": 3
    }
  ]
}

Parameters

The simple_pattern_split tokenizer can be configured with the following parameter.

Parameter	Required/Optional	Data type	Description
`pattern`	Optional	String	The pattern used to split text into tokens, specified using a Lucene regular expression. Default is an empty string, which returns the input text as one token.

Example usage
Generated tokens
Parameters

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Simple pattern split tokenizer

Example usage

Generated tokens

Parameters

OpenSearch Links

Get Involved

Resources

Contact Us