Pattern tokenizer
The pattern
tokenizer is a highly flexible tokenizer that allows you to split text into tokens based on a custom Java regular expression. Unlike the simple_pattern
and simple_pattern_split
tokenizers, which use Lucene regular expressions, the pattern
tokenizer can handle more complex and detailed regex patterns, offering greater control over how the text is tokenized.
Example usage
The following example request creates a new index named my_index
and configures an analyzer with a pattern
tokenizer. The tokenizer splits text on -
, _
, or .
characters:
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_pattern_tokenizer": {
"type": "pattern",
"pattern": "[-_.]"
}
},
"analyzer": {
"my_pattern_analyzer": {
"type": "custom",
"tokenizer": "my_pattern_tokenizer"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_pattern_analyzer"
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
POST /my_index/_analyze
{
"analyzer": "my_pattern_analyzer",
"text": "OpenSearch-2024_v1.2"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "OpenSearch",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "2024",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "v1",
"start_offset": 16,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "2",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
Parameters
The pattern
tokenizer can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
pattern | Optional | String | The pattern used to split text into tokens, specified using a Java regular expression. Default is \W+ . |
flags | Optional | String | Configures pipe-separated flags to apply to the regular expression, for example, "CASE_INSENSITIVE|MULTILINE|DOTALL" . |
group | Optional | Integer | Specifies the capture group to be used as a token. Default is -1 (split on a match). |
Example using a group parameter
The following example request configures a group
parameter that captures only the second group:
PUT /my_index_group2
{
"settings": {
"analysis": {
"tokenizer": {
"my_pattern_tokenizer": {
"type": "pattern",
"pattern": "([a-zA-Z]+)(\\d+)",
"group": 2
}
},
"analyzer": {
"my_pattern_analyzer": {
"type": "custom",
"tokenizer": "my_pattern_tokenizer"
}
}
}
}
}
Use the following request to examine the tokens generated using the analyzer:
POST /my_index_group2/_analyze
{
"analyzer": "my_pattern_analyzer",
"text": "abc123def456ghi"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "123",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "456",
"start_offset": 9,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}