You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

Fingerprint analyzer

The fingerprint analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order.

The fingerprint analyzer comprises the following components:

Standard tokenizer
Lowercase token filter
ASCII folding token filter
Stop token filter
Fingerprint token filter

Parameters

The fingerprint analyzer can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`separator`	Optional	String	Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space ( ).
`max_output_size`	Optional	Integer	Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`.
`stopwords`	Optional	String or list of strings	A custom or predefined list of stopwords. Default is `_none_`.
`stopwords_path`	Optional	String	The path (absolute or relative to the config directory) to the file containing a list of stopwords.

Example

Use the following command to create an index named my_custom_fingerprint_index with a fingerprint analyzer:

PUT /my_custom_fingerprint_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_fingerprint_analyzer": {
          "type": "fingerprint",
          "separator": "-",
          "max_output_size": 50,
          "stopwords": ["to", "the", "over", "and"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_custom_fingerprint_analyzer"
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_custom_fingerprint_index/_analyze
{
  "analyzer": "my_custom_fingerprint_analyzer",
  "text": "The slow turtle swims over to the dog"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "dog-slow-swims-turtle",
      "start_offset": 0,
      "end_offset": 37,
      "type": "fingerprint",
      "position": 0
    }
  ]
}

Further customization

If further customization is needed, you can define an analyzer with additional fingerprint analyzer components:

PUT /custom_fingerprint_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}

Parameters
Example
Generated tokens
Further customization

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Fingerprint analyzer

Parameters

Example

Generated tokens

Further customization

OpenSearch Links

Get Involved

Resources

Contact Us