Link Search Menu Expand Document Documentation Menu

Delimited term frequency token filter

The delimited_term_freq token filter separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is |, then for the string foo|5, foo is the token and 5 is its term frequency. If there is no delimiter, the token filter does not modify the term frequency.

You can either use a preconfigured delimited_term_freq token filter or create a custom one.

Preconfigured delimited_term_freq token filter

The preconfigured delimited_term_freq token filter uses the | default delimiter. To analyze text with the preconfigured token filter, send the following request to the _analyze endpoint:

POST /_analyze
{
  "text": "foo|100",
  "tokenizer": "keyword",
  "filter": ["delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

The attributes array specifies that you want to filter the output of the explain parameter to return only termFrequency. The response contains both the original token and the parsed output of the token filter that includes the term frequency:

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo|100",
          "start_offset": 0,
          "end_offset": 7,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0,
            "termFrequency": 100
          }
        ]
      }
    ]
  }
}

Custom delimited_term_freq token filter

To configure a custom delimited_term_freq token filter, first specify the delimiter in the mapping request, in this example, ^:

PUT /testindex
{
  "settings": {
    "analysis": {
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      }
    }
  }
}

Then analyze text with the custom token filter you created:

POST /testindex/_analyze
{
  "text": "foo^3",
  "tokenizer": "keyword",
  "filter": ["my_delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

The response contains both the original token and the parsed version with the term frequency:

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo|100",
          "start_offset": 0,
          "end_offset": 7,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0,
            "termFrequency": 100
          }
        ]
      }
    ]
  }
}

Combining delimited_token_filter with scripts

You can write Painless scripts to calculate custom scores for the documents in the results.

First, create an index and provide the following mappings and settings:

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "keyword_tokenizer": {
          "type": "keyword"
        }
      },
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      },
      "analyzer": {
        "custom_delimited_analyzer": {
          "tokenizer": "keyword_tokenizer",
          "filter": ["my_delimited_term_freq"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "f1": {
        "type": "keyword"
      },
      "f2": {
        "type": "text",
        "analyzer": "custom_delimited_analyzer",
        "index_options": "freqs"
      }
    }
  }
}

The test index uses a keyword tokenizer, a delimited term frequency token filter (where the delimiter is ^), and a custom analyzer that includes a keyword tokenizer and a delimited term frequency token filter. The mappings specify that the field f1 is a keyword field and the field f2 is a text field. The field f2 uses the custom analyzer defined in the settings for text analysis. Additionally, specifying index_options signals to OpenSearch to add the term frequencies to the inverted index. You’ll use the term frequencies to give documents with repeated terms a higher score.

Next, index two documents using bulk upload:

POST /_bulk?refresh=true
{"index": {"_index": "test", "_id": "doc1"}}
{"f1": "v0|100", "f2": "v1^30"}
{"index": {"_index": "test", "_id": "doc2"}}
{"f2": "v2|100"}

The following query searches for all documents in the index and calculates document scores as the term frequency of the term v1 in the field f2:

GET /test/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "script_score": {
        "script": {
          "source": "termFreq(params.field, params.term)",
          "params": {
            "field": "f2",
            "term": "v1"
          }
        }
      }
    }
  }
}

In the response, document 1 has a score of 30 because the term frequency of the term v1 in the field f2 is 30. Document 2 has a score of 0 because the term v1 does not appear in f2:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 30,
    "hits": [
      {
        "_index": "test",
        "_id": "doc1",
        "_score": 30,
        "_source": {
          "f1": "v0|100",
          "f2": "v1^30"
        }
      },
      {
        "_index": "test",
        "_id": "doc2",
        "_score": 0,
        "_source": {
          "f2": "v2|100"
        }
      }
    ]
  }
}

Parameters

The following table lists all parameters that the delimited_term_freq supports.

Parameter Required/Optional Description
delimiter Optional The delimiter used to separate tokens from term frequencies. Must be a single non-null character. Default is |.
350 characters left

Have a question? .

Want to contribute? or .