Normalizers

A normalizer functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.

A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query Naïve with the index term naive.

Consider the following example.

Create a new index with a custom normalizer:

PUT /sample-index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "normalized_keyword": {
          "type": "custom",
          "char_filter": [],
          "filter": [ "asciifolding", "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "approach": {
        "type": "keyword",
        "normalizer": "normalized_keyword"
      }
    }
  }
}

Index a document:

POST /sample-index/_doc/
{
  "approach": "naive"
}

The following query matches the document. This is expected:

GET /sample-index/_search
{
  "query": {
    "term": {
      "approach": "naive"
    }
  }
}

But this query matches the document as well:

GET /sample-index/_search
{
  "query": {
    "term": {
      "approach": "Naïve"
    }
  }
}

To understand why, consider the effect of the normalizer:

GET /sample-index/_analyze
{
  "normalizer" : "normalized_keyword",
  "text" : "Naïve"
}

Internally, a normalizer accepts only filters that are instances of either NormalizingTokenFilterFactory or NormalizingCharFilterFactory. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.

The `common-analysis` module

This module does not require installation; it is available by default.

Character filters: pattern_replace, mapping

Token filters: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, persian_normalization, scandinavian_folding, scandinavian_normalization, serbian_normalization, sorani_normalization, trim, uppercase

The `analysis-icu` plugin

Character filters: icu_normalizer

Token filters: icu_normalizer, icu_folding, icu_transform

The `analysis-kuromoji` plugin

Character filters: normalize_kanji, normalize_kana

The `analysis-nori` plugin

Character filters: normalize_kanji, normalize_kana

These lists of filters include only analysis components found in the additional plugins that are part of the core OpenSearch repository.

The common-analysis module
The analysis-icu plugin
The analysis-kuromoji plugin
The analysis-nori plugin

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Normalizers

The `common-analysis` module

The `analysis-icu` plugin

The `analysis-kuromoji` plugin

The `analysis-nori` plugin

OpenSearch Links

Get Involved

Resources

Contact Us

Normalizers

The common-analysis module

The analysis-icu plugin

The analysis-kuromoji plugin

The analysis-nori plugin

The `common-analysis` module

The `analysis-icu` plugin

The `analysis-kuromoji` plugin

The `analysis-nori` plugin