You're viewing version 2.18 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
HTML strip character filter
The html_strip
character filter removes HTML tags, such as <div>
, <p>
, and <a>
, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as
, into spaces.
Example
The following request applies an html_strip
character filter to the provided text:
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>Commonly used calculus symbols include α, β and θ </p>"
}
The response contains the token in which HTML characters have been converted to their decoded values:
{
"tokens": [
{
"token": """
Commonly used calculus symbols include α, β and θ
""",
"start_offset": 0,
"end_offset": 74,
"type": "word",
"position": 0
}
]
}
Parameters
The html_strip
character filter can be configured with the following parameter.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
escaped_tags | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (< > ). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to ["b", "i"] will prevent the <b> and <i> elements from being stripped. |
Example: Custom analyzer with lowercase filter
The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the html_strip
analyzer and lowercase
filter:
PUT /html_strip_and_lowercase_analyzer
{
"settings": {
"analysis": {
"char_filter": {
"html_filter": {
"type": "html_strip"
}
},
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_filter"],
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
GET /html_strip_and_lowercase_analyzer/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
}
In the response, the HTML tags have been removed and the plain text has been converted to lowercase:
{
"tokens": [
{
"token": "welcome",
"start_offset": 4,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "to",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "opensearch",
"start_offset": 23,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Example: Custom analyzer that preserves HTML tags
The following example request creates a custom analyzer that preserves HTML tags:
PUT /html_strip_preserve_analyzer
{
"settings": {
"analysis": {
"char_filter": {
"html_filter": {
"type": "html_strip",
"escaped_tags": ["b", "i"]
}
},
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_filter"],
"tokenizer": "keyword"
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
GET /html_strip_preserve_analyzer/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
}
In the response, the italic
and bold
tags have been retained, as specified in the custom analyzer request:
{
"tokens": [
{
"token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
"start_offset": 0,
"end_offset": 52,
"type": "word",
"position": 0
}
]
}