You're viewing version 2.12 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

Tokenizers

A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words. becomes [Actions, speak, louder, than, words.].

The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:

The order or position of each token: This information is used for word and phrase proximity queries.
The starting and ending positions (offsets) of the tokens in the text: This information is used for highlighting search terms.
The token type: Some tokenizers (for example, standard) classify tokens by type, for example, <ALPHANUM> or <NUM>. Simpler tokenizers (for example, letter) only classify tokens as type word.

You can use tokenizers to define custom analyzers.

Built-in tokenizers

The following tables list the built-in tokenizers that OpenSearch provides.

Word tokenizers

Word tokenizers parse full text into words.

Tokenizer	Description	Example
`standard`	- Parses strings into tokens at word boundaries - Removes most punctuation	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`]
`letter`	- Parses strings into tokens on any non-letter character - Removes non-letter characters	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`]
`lowercase`	- Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts terms to lowercase	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
`whitespace`	- Parses strings into tokens at white space characters	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
`uax_url_email`	- Similar to the standard tokenizer - Unlike the standard tokenizer, leaves URLs and email addresses as single terms	`It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `opensearch-project@github.com`]
`classic`	- Parses strings into tokens on: - Punctuation characters that are followed by a white space character - Hyphens if the term does not contain numbers - Removes punctuation - Leaves URLs and email addresses as single terms	`Part number PA-35234, single-use product (128.32)` becomes [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`]
`thai`	- Parses Thai text into terms	`สวัสดีและยินดีต` becomes [`สวัสด`, `และ`, `ยินดี`, `ต`]

Partial word tokenizers

Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.

Tokenizer	Description	Example
`ngram`	- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word	`My repo` becomes [`M`, `My`, `y`, `y` , , `r`, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] because the default n-gram length is 1–2 characters
`edge_ngram`	- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word)	`My repo` becomes [`M`, `My`] because the default n-gram length is 1–2 characters

Structured text tokenizers

Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.

Tokenizer	Description	Example
`keyword`	- No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms	`My repo` becomes `My repo`
`pattern`	- Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses Java regular expressions	`https://opensearch.org/forum` becomes [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`) Can be configured with a regex pattern
`simple_pattern`	- Uses a regular expression pattern to return matching text as terms - Uses Lucene regular expressions - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions	Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string
`simple_pattern_split`	- Uses a regular expression pattern to split the text at matches rather than returning the matches as terms - Uses Lucene regular expressions - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions	No-op by default Must be configured with a pattern
`char_group`	- Parses on a set of configurable characters - Faster than tokenizers that run regular expressions	No-op by default Must be configured with a list of characters
`path_hierarchy`	- Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy	`one/two/three` becomes [`one`, `one/two`, `one/two/three`]

Built-in tokenizers

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Tokenizers

Built-in tokenizers

Word tokenizers

Partial word tokenizers

Structured text tokenizers

OpenSearch Links

Get Involved

Resources

Contact Us