Token filters

Token filters receive the stream of tokens from the tokenizer and add, remove, or modify the tokens. For example, a token filter may lowercase the tokens so that Actions becomes action, remove stopwords like than, or add synonyms like talk for the word speak.

The following table lists all token filters that OpenSearch supports.

Token filter	Underlying Lucene token filter	Description
`apostrophe`	ApostropheFilter	In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
`asciifolding`	ASCIIFoldingFilter	Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram`	CJKBigramFilter	Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width`	CJKWidthFilter	Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: - Folds full-width ASCII character variants into their equivalent basic Latin characters. - Folds half-width katakana character variants into their equivalent kana characters.
`classic`	ClassicFilter	Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
`common_grams`	CommonGramsFilter	Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
`conditional`	ConditionalTokenFilter	Applies an ordered list of token filters to tokens that match the conditions provided in a script.
`decimal_digit`	DecimalDigitFilter	Converts all digits in the Unicode decimal number general category to basic Latin digits (0–9).
`delimited_payload`	DelimitedPayloadTokenFilter	Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters preceding the delimiter, and a payload consists of all characters following the delimiter. For example, if the delimiter is `\|`, then for the string `foo\|bar`, `foo` is the token and `bar` is the payload.
`delimited_term_freq`	DelimitedTermFrequencyTokenFilter	Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `\|`, then for the string `foo\|5`, `foo` is the token and `5` is the term frequency.
`dictionary_decompounder`	DictionaryCompoundWordTokenFilter	Decomposes compound words found in many Germanic languages.
`edge_ngram`	EdgeNGramTokenFilter	Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
`elision`	ElisionFilter	Removes the specified elisions from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
`fingerprint`	FingerprintFilter	Sorts and deduplicates the token list and concatenates tokens into a single token.
`flatten_graph`	FlattenGraphFilter	Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
`hunspell`	HunspellStemFilter	Uses Hunspell rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries.
`hyphenation_decompounder`	HyphenationCompoundWordTokenFilter	Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
`keep_types`	TypeTokenFilter	Keeps or removes tokens of a specific type.
`keep_words`	KeepWordFilter	Checks the tokens against the specified word list and keeps only those that are in the list.
`keyword_marker`	KeywordMarkerFilter	Marks specified tokens as keywords, preventing them from being stemmed.
`keyword_repeat`	KeywordRepeatFilter	Emits each incoming token twice: once as a keyword and once as a non-keyword.
`kstem`	KStemFilter	Provides KStem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
`kuromoji_completion`	JapaneseCompletionFilter	Adds Japanese romanized terms to a token stream (in addition to the original tokens). Usually used to support autocomplete of Japanese search terms. Note that the filter has a `mode` parameter that should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see Additional plugins.
`length`	LengthFilter	Removes tokens that are shorter or longer than the length range specified by `min` and `max`.
`limit`	LimitTokenCountFilter	Limits the number of output tokens. For example, document field value sizes can be limited based on the token count.
`lowercase`	LowerCaseFilter	Converts tokens to lowercase. The default LowerCaseFilter processes the English language. To process other languages, set the `language` parameter to `greek` (uses GreekLowerCaseFilter), `irish` (uses IrishLowerCaseFilter), or `turkish` (uses TurkishLowerCaseFilter).
`min_hash`	MinHashFilter	Uses the MinHash technique to estimate document similarity. Performs the following operations on a token stream sequentially: 1. Hashes each token in the stream. 2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket. 3. Outputs the smallest hash from each bucket as a token stream.
`multiplexer`	N/A	Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
`ngram`	NGramTokenFilter	Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`.
Normalization	`arabic_normalization`: ArabicNormalizer `german_normalization`: GermanNormalizationFilter `hindi_normalization`: HindiNormalizer `indic_normalization`: IndicNormalizer `sorani_normalization`: SoraniNormalizer `persian_normalization`: PersianNormalizer `scandinavian_normalization` : ScandinavianNormalizationFilter `scandinavian_folding`: ScandinavianFoldingFilter `serbian_normalization`: SerbianNormalizationFilter	Normalizes the characters of one of the listed languages.
`pattern_capture`	N/A	Generates a token for every capture group in the provided regular expression. Uses Java regular expression syntax.
`pattern_replace`	N/A	Matches a pattern in the provided regular expression and replaces matching substrings. Uses Java regular expression syntax.
`phonetic`	N/A	Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin.
`porter_stem`	PorterStemFilter	Uses the Porter stemming algorithm to perform algorithmic stemming for the English language.
`predicate_token_filter`	N/A	Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts.
`remove_duplicates`	RemoveDuplicatesTokenFilter	Removes duplicate tokens that are in the same position.
`reverse`	ReverseStringFilter	Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
`shingle`	ShingleFilter	Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but are generated using words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
`snowball`	N/A	Stems words using a Snowball-generated stemmer. The `snowball` token filter supports using the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
`stemmer`	N/A	Provides algorithmic stemming for the following languages used in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
`stemmer_override`	N/A	Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop`	StopFilter	Removes stop words from a token stream.
`synonym`	N/A	Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
`synonym_graph`	N/A	Supplies a synonym list, including multiword synonyms, for the analysis process.
`trim`	TrimFilter	Trims leading and trailing white space characters from each token in a stream.
`truncate`	TruncateTokenFilter	Truncates tokens with lengths exceeding the specified character limit.
`unique`	N/A	Ensures each token is unique by removing duplicate tokens from a stream.
`uppercase`	UpperCaseFilter	Converts tokens to uppercase.
`word_delimiter`	WordDelimiterFilter	Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules.
`word_delimiter_graph`	WordDelimiterGraphFilter	Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens.

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Token filters

OpenSearch Links

Get Involved

Resources

Contact Us