Link Search Menu Expand Document Documentation Menu

Fingerprint analyzer

The fingerprint analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order.

The fingerprint analyzer comprises the following components:

  • Standard tokenizer
  • Lowercase token filter
  • ASCII folding token filter
  • Stop token filter
  • Fingerprint token filter

Parameters

The fingerprint analyzer can be configured with the following parameters.

Parameter Required/Optional Data type Description
separator Optional String Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space ( ).
max_output_size Optional Integer Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is 255.
stopwords Optional String or list of strings A custom or predefined list of stopwords. Default is _none_.
stopwords_path Optional String The path (absolute or relative to the config directory) to the file containing a list of stopwords.

Example

Use the following command to create an index named my_custom_fingerprint_index with a fingerprint analyzer:

PUT /my_custom_fingerprint_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_fingerprint_analyzer": {
          "type": "fingerprint",
          "separator": "-",
          "max_output_size": 50,
          "stopwords": ["to", "the", "over", "and"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_custom_fingerprint_analyzer"
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_custom_fingerprint_index/_analyze
{
  "analyzer": "my_custom_fingerprint_analyzer",
  "text": "The slow turtle swims over to the dog"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "dog-slow-swims-turtle",
      "start_offset": 0,
      "end_offset": 37,
      "type": "fingerprint",
      "position": 0
    }
  ]
}

Further customization

If further customization is needed, you can define an analyzer with additional fingerprint analyzer components:

PUT /custom_fingerprint_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}

350 characters left

Have a question? .

Want to contribute? or .