You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

HTML strip character filter

The html_strip character filter removes HTML tags, such as <div>, <p>, and <a>, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as  , into spaces.

Example

The following request applies an html_strip character filter to the provided text:

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
}

The response contains the token in which HTML characters have been converted to their decoded values:

{
  "tokens": [
    {
      "token": """
Commonly used calculus symbols include α, β and θ 
""",
      "start_offset": 0,
      "end_offset": 74,
      "type": "word",
      "position": 0
    }
  ]
}

Parameters

The html_strip character filter can be configured with the following parameter.

Parameter	Required/Optional	Data type	Description
`escaped_tags`	Optional	Array of strings	An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `<b>` and `<i>` elements from being stripped.

Example: Custom analyzer with lowercase filter

The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the html_strip analyzer and lowercase filter:

PUT /html_strip_and_lowercase_analyzer
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_filter": {
          "type": "html_strip"
        }
      },
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_filter"],
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET /html_strip_and_lowercase_analyzer/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
}

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

{
  "tokens": [
    {
      "token": "welcome",
      "start_offset": 4,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "to",
      "start_offset": 12,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "opensearch",
      "start_offset": 23,
      "end_offset": 42,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Example: Custom analyzer that preserves HTML tags

The following example request creates a custom analyzer that preserves HTML tags:

PUT /html_strip_preserve_analyzer
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_filter": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_filter"],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

Use the following request to examine the tokens generated using the analyzer:

GET /html_strip_preserve_analyzer/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
}

In the response, the italic and bold tags have been retained, as specified in the custom analyzer request:

{
  "tokens": [
    {
      "token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
      "start_offset": 0,
      "end_offset": 52,
      "type": "word",
      "position": 0
    }
  ]
}

Example
Parameters
Example: Custom analyzer with lowercase filter
Example: Custom analyzer that preserves HTML tags

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

HTML strip character filter

Example

Parameters

Example: Custom analyzer with lowercase filter

Example: Custom analyzer that preserves HTML tags

OpenSearch Links

Get Involved

Resources

Contact Us