Link Search Menu Expand Document Documentation Menu

HTML strip character filter

The html_strip character filter removes HTML tags, such as <div>, <p>, and <a>, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as &nbsp;, into spaces.

Example

The following request applies an html_strip character filter to the provided text:

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
}

The response contains the token in which HTML characters have been converted to their decoded values:

{
  "tokens": [
    {
      "token": """
Commonly used calculus symbols include α, β and θ 
""",
      "start_offset": 0,
      "end_offset": 74,
      "type": "word",
      "position": 0
    }
  ]
}

Parameters

The html_strip character filter can be configured with the following parameter.

Parameter Required/Optional Data type Description
escaped_tags Optional Array of strings An array of HTML element names, specified without the enclosing angle brackets (< >). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to ["b", "i"] will prevent the <b> and <i> elements from being stripped.

Example: Custom analyzer with lowercase filter

The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the html_strip analyzer and lowercase filter:

PUT /html_strip_and_lowercase_analyzer
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_filter": {
          "type": "html_strip"
        }
      },
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_filter"],
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET /html_strip_and_lowercase_analyzer/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
}

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

{
  "tokens": [
    {
      "token": "welcome",
      "start_offset": 4,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "to",
      "start_offset": 12,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "opensearch",
      "start_offset": 23,
      "end_offset": 42,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Example: Custom analyzer that preserves HTML tags

The following example request creates a custom analyzer that preserves HTML tags:

PUT /html_strip_preserve_analyzer
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_filter": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_filter"],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET /html_strip_preserve_analyzer/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
}

In the response, the italic and bold tags have been retained, as specified in the custom analyzer request:

{
  "tokens": [
    {
      "token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
      "start_offset": 0,
      "end_offset": 52,
      "type": "word",
      "position": 0
    }
  ]
}
350 characters left

Have a question? .

Want to contribute? or .