Pattern analyzer
The pattern
analyzer allows you to define a custom analyzer that uses a regular expression (regex) to split input text into tokens. It also provides options for applying regex flags, converting tokens to lowercase, and filtering out stopwords.
Parameters
The pattern
analyzer can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
pattern | Optional | String | A Java regular expression used to tokenize the input. Default is \W+ . |
flags | Optional | String | A string containing pipe-separated Java regex flags that modify the behavior of the regular expression. |
lowercase | Optional | Boolean | Whether to convert tokens to lowercase. Default is true . |
stopwords | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as _english_ ) or an array specifying a custom list of stopwords. Default is _none_ . |
stopwords_path | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. |
Example
Use the following command to create an index named my_pattern_index
with a pattern
analyzer:
PUT /my_pattern_index
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_analyzer": {
"type": "pattern",
"pattern": "\\W+",
"lowercase": true,
"stopwords": ["and", "is"]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_pattern_analyzer"
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /my_pattern_index/_analyze
{
"analyzer": "my_pattern_analyzer",
"text": "OpenSearch is fast and scalable"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "opensearch",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "fast",
"start_offset": 14,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "scalable",
"start_offset": 23,
"end_offset": 31,
"type": "word",
"position": 4
}
]
}