You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
Remove duplicates token filter
The remove_duplicates
token filter is used to remove duplicate tokens that are generated in the same position during analysis.
Example
The following example request creates an index with a keyword_repeat
token filter. The filter adds a keyword
version of each token in the same position as the token itself and then uses a kstem
to create a stemmed version of the token:
PUT /example-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keyword_repeat",
"kstem"
]
}
}
}
}
}
Use the following request to analyze the string Slower turtle
:
GET /example-index/_analyze
{
"analyzer": "custom_analyzer",
"text": "Slower turtle"
}
The response contains the token turtle
twice in the same position:
{
"tokens": [
{
"token": "slower",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "turtle",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "turtle",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
The duplicate token can be removed by adding a remove_duplicates
token filter to the index settings:
PUT /index-remove-duplicate
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keyword_repeat",
"kstem",
"remove_duplicates"
]
}
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
GET /index-remove-duplicate/_analyze
{
"analyzer": "custom_analyzer",
"text": "Slower turtle"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "slower",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "turtle",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}