You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
Fingerprint analyzer
The fingerprint
analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order.
The fingerprint
analyzer comprises the following components:
- Standard tokenizer
- Lowercase token filter
- ASCII folding token filter
- Stop token filter
- Fingerprint token filter
Parameters
The fingerprint
analyzer can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
separator | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space ( ). |
max_output_size | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is 255 . |
stopwords | Optional | String or list of strings | A custom or predefined list of stopwords. Default is _none_ . |
stopwords_path | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. |
Example
Use the following command to create an index named my_custom_fingerprint_index
with a fingerprint
analyzer:
PUT /my_custom_fingerprint_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_fingerprint_analyzer": {
"type": "fingerprint",
"separator": "-",
"max_output_size": 50,
"stopwords": ["to", "the", "over", "and"]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_custom_fingerprint_analyzer"
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /my_custom_fingerprint_index/_analyze
{
"analyzer": "my_custom_fingerprint_analyzer",
"text": "The slow turtle swims over to the dog"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "dog-slow-swims-turtle",
"start_offset": 0,
"end_offset": 37,
"type": "fingerprint",
"position": 0
}
]
}
Further customization
If further customization is needed, you can define an analyzer with additional fingerprint
analyzer components:
PUT /custom_fingerprint_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"custom_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}