Link Search Menu Expand Document Documentation Menu

Pattern replace character filter

The pattern_replace character filter allows you to use regular expressions to define patterns for matching and replacing characters in the input text. It is a flexible tool for advanced text transformations, especially when dealing with complex string patterns.

This filter replaces all instances of a pattern with a specified replacement string, allowing for easy substitutions, deletions, or complex modifications of the input text. You can use it to normalize the input before tokenization.

Example

To standardize phone numbers, you’ll use the regular expression [\\s()-]+:

  • [ ]: Defines a character class, meaning it will match any one of the characters inside the brackets.
  • \\s: Matches any white space character, such as a space, tab, or newline.
  • (): Matches literal parentheses (( or )).
  • -: Matches a literal hyphen (-).
  • +: Specifies that the pattern should match one or more occurrences of the preceding characters.

The pattern [\\s()-]+ will match any sequence of one or more white space characters, parentheses, or hyphens and remove it from the input text. This ensures that the phone numbers are normalized and contain only digits.

The following request standardizes phone numbers by removing spaces, dashes, and parentheses:

GET /_analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "[\\s()-]+",
      "replacement": ""
    }
  ],
  "text": "(555) 123-4567"
}

The response contains the generated token:

{
  "tokens": [
    {
      "token": "5551234567",
      "start_offset": 1,
      "end_offset": 14,
      "type": "<NUM>",
      "position": 0
    }
  ]
}

Parameters

The pattern_replace character filter must be configured with the following parameters.

Parameter Required/Optional Data type Description
pattern Required String A regular expression used to match parts of the input text. The filter identifies and matches this pattern to perform replacement.
replacement Optional String The string that replaces pattern matches. Use an empty string ("") to remove the matched text. Default is an empty string ("").

Creating a custom analyzer

The following request creates an index with a custom analyzer configured with a pattern_replace character filter. The filter removes currency signs and thousands separators (both European . and American ,) from numbers:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "pattern_char_filter"
          ]
        }
      },
      "char_filter": {
        "pattern_char_filter": {
          "type": "pattern_replace",
          "pattern": "[$€,.]",
          "replacement": ""
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Total: $ 1,200.50 and € 1.100,75"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "Total",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "120050",
      "start_offset": 9,
      "end_offset": 17,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "and",
      "start_offset": 18,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "110075",
      "start_offset": 24,
      "end_offset": 32,
      "type": "<NUM>",
      "position": 3
    }
  ]
}

Using capturing groups

You can use capturing groups in the replacement parameter. For example, the following request creates a custom analyzer that uses a pattern_replace character filter to replace hyphens with dots in phone numbers:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "pattern_char_filter"
          ]
        }
      },
      "char_filter": {
        "pattern_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1."
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Call me at 555-123-4567 or 555-987-6543"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "Call",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "me",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "at",
      "start_offset": 8,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "555.123.4567",
      "start_offset": 11,
      "end_offset": 23,
      "type": "<NUM>",
      "position": 3
    },
    {
      "token": "or",
      "start_offset": 24,
      "end_offset": 26,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "555.987.6543",
      "start_offset": 27,
      "end_offset": 39,
      "type": "<NUM>",
      "position": 5
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .