You're viewing version 2.12 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
Tokenizers
A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words.
becomes [Actions
, speak
, louder
, than
, words.
].
The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:
- The order or position of each token: This information is used for word and phrase proximity queries.
- The starting and ending positions (offsets) of the tokens in the text: This information is used for highlighting search terms.
- The token type: Some tokenizers (for example,
standard
) classify tokens by type, for example,<ALPHANUM>
or<NUM>
. Simpler tokenizers (for example,letter
) only classify tokens as typeword
.
You can use tokenizers to define custom analyzers.
Built-in tokenizers
The following tables list the built-in tokenizers that OpenSearch provides.
Word tokenizers
Word tokenizers parse full text into words.
Tokenizer | Description | Example |
---|---|---|
standard | - Parses strings into tokens at word boundaries - Removes most punctuation | It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It’s , fun , to , contribute , a ,brand , new , PR , or , 2 , to , OpenSearch ] |
letter | - Parses strings into tokens on any non-letter character - Removes non-letter characters | It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It , s , fun , to , contribute , a ,brand , new , PR , or , to , OpenSearch ] |
lowercase | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts terms to lowercase | It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ it , s , fun , to , contribute , a ,brand , new , pr , or , to , opensearch ] |
whitespace | - Parses strings into tokens at white space characters | It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It’s , fun , to , contribute , a ,brand-new , PR , or , 2 , to , OpenSearch! ] |
uax_url_email | - Similar to the standard tokenizer - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com! becomes [ It’s , fun , to , contribute , a ,brand , new , PR , or , 2 , to , OpenSearch , opensearch-project@github.com ] |
classic | - Parses strings into tokens on: - Punctuation characters that are followed by a white space character - Hyphens if the term does not contain numbers - Removes punctuation - Leaves URLs and email addresses as single terms | Part number PA-35234, single-use product (128.32) becomes [ Part , number , PA-35234 , single , use , product , 128.32 ] |
thai | - Parses Thai text into terms | สวัสดีและยินดีต becomes [ สวัสด , และ , ยินดี , ต ] |
Partial word tokenizers
Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.
Tokenizer | Description | Example |
---|---|---|
ngram | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | My repo becomes [ M , My , y , y , , r , r , re , e , ep , p , po , o ] because the default n-gram length is 1–2 characters |
edge_ngram | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | My repo becomes [ M , My ] because the default n-gram length is 1–2 characters |
Structured text tokenizers
Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.
Tokenizer | Description | Example |
---|---|---|
keyword | - No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms | My repo becomes My repo |
pattern | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses Java regular expressions | https://opensearch.org/forum becomes [ https , opensearch , org , forum ] because by default the tokenizer splits terms at word boundaries (\W+ )Can be configured with a regex pattern |
simple_pattern | - Uses a regular expression pattern to return matching text as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions | Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string |
simple_pattern_split | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions | No-op by default Must be configured with a pattern |
char_group | - Parses on a set of configurable characters - Faster than tokenizers that run regular expressions | No-op by default Must be configured with a list of characters |
path_hierarchy | - Parses text on the path separator (by default, / ) and returns a full path to each component in the tree hierarchy | one/two/three becomes [ one , one/two , one/two/three ] |