This documentation describes using the grok
processor in OpenSearch ingest pipelines. Consider using the OpenSearch Data Prepper grok
processor, which runs on the OpenSearch cluster, if your use case involves large or complex datasets.
Grok processor
The grok
processor is used to parse and structure unstructured data using pattern matching. You can use the grok
processor to extract fields from log messages, web server access logs, application logs, and other log data that follows a consistent format.
Grok basics
The grok
processor uses a set of predefined patterns to match parts of the input text. Each pattern consists of a name and a regular expression. For example, the pattern %{IP:ip_address}
matches an IP address and assigns it to the field ip_address
. You can combine multiple patterns to create more complex expressions. For example, the pattern %{IP:client} %{WORD:method} %{URIPATHPARM:request} %{NUMBER:bytes %NUMBER:duration}
matches a line from a web server access log and extracts the client IP address, the HTTP method, the request URI, the number of bytes sent, and the duration of the request.
The grok
processor is built on the Oniguruma regular expression library and supports all the patterns from that library. You can use the Grok Debugger tool to test and debug your grok expressions.
Syntax
The following is the basic syntax for the grok
processor:
{
"grok": {
"field": "your_message",
"patterns": ["your_patterns"]
}
}
Configuration parameters
To configure the grok
processor, you have various options that allow you to define patterns, match specific keys, and control the processor’s behavior. The following table lists the required and optional parameters for the grok
processor.
Parameter | Required/Optional | Description |
---|---|---|
field | Required | The name of the field containing the text to be parsed. |
patterns | Required | A list of grok expressions used to match and extract named captures. The first matching expression in the list is returned. |
pattern_definitions | Optional | A dictionary of pattern names and pattern tuples used to define custom patterns for the current processor. If a pattern matches an existing name, it overrides the pre-existing definition. |
trace_match | Optional | When the parameter is set to true , the processor adds a field named _grok_match_index to the processed document. This field contains the index of the pattern within the patterns array that successfully matched the document. This information can be useful for debugging and understanding which pattern was applied to the document. Default is false . |
description | Optional | A brief description of the processor. |
if | Optional | A condition for running the processor. |
ignore_failure | Optional | Specifies whether the processor continues execution even if it encounters errors. If set to true , failures are ignored. Default is false . |
ignore_missing | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. If set to true , the processor does not modify the document if the field does not exist or is null . Default is false . |
on_failure | Optional | A list of processors to run if the processor fails. |
tag | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. |
Creating a pipeline
The following steps guide you through creating an ingest pipeline with the grok
processor.
Step 1: Create a pipeline
The following query creates a pipeline, named log_line
. It extracts fields from the message
field of the document using the specified pattern. In this case, it extracts the clientip
, timestamp
, and response_status
fields:
PUT _ingest/pipeline/log_line
{
"description": "Extract fields from a log line",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"]
}
}
]
}
Step 2 (Optional): Test the pipeline.
NOTE
It is recommended that you test your pipeline before you ingest documents.
To test the pipeline, run the following query:
POST _ingest/pipeline/log_line/_simulate
{
"docs": [
{
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200"
}
}
]
}
Response
The following response confirms that the pipeline is working as expected:
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200",
"response_status": 200,
"clientip": "198.126.12",
"timestamp": "10/Oct/2000:13:55:36 -0700"
},
"_ingest": {
"timestamp": "2023-09-13T21:41:52.064540505Z"
}
}
}
]
}
Step 3: Ingest a document
The following query ingests a document into an index named testindex1
:
PUT testindex1/_doc/1?pipeline=log_line
{
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200"
}
Step 4 (Optional): Retrieve the document
To retrieve the document, run the following query:
GET testindex1/_doc/1
Custom patterns
You can use default patterns, or you can add custom patterns to your pipelines using the patterns_definitions
parameter. Custom grok patterns can be used in a pipeline to extract structured data from log messages that do not match the built-in grok patterns. This can be useful for parsing log messages from custom applications or for parsing log messages that have been modified in some way. Custom patterns adhere to a straightforward structure: each pattern has a unique name and the corresponding regular expression that defines its matching behavior.
The following is an example of how to include a custom pattern in your configuration. In this example, the issue number is between 3 and 4 digits and is parsed into the issue_number
field and the status is parsed into the status
field:
PUT _ingest/pipeline/log_line
{
"processors": [
{
"grok": {
"field": "message",
"patterns": ["The issue number %{NUMBER:issue_number} is %{STATUS:status}"],
"pattern_definitions" : {
"NUMBER" : "\\d{3,4}",
"STATUS" : "open|closed"
}
}
}
]
}
Tracing which patterns matched
To trace which patterns matched and populated the fields, you can use the trace_match
parameter. The following is an example of how to include this parameter in your configuration:
PUT _ingest/pipeline/log_line
{
"description": "Extract fields from a log line",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{HTTPDATE:timestamp} %{IPORHOST:clientip}", "%{IPORHOST:clientip} %{HTTPDATE:timestamp} %{NUMBER:response_status:int}"],
"trace_match": true
}
}
]
}
When you simulate the pipeline, OpenSearch returns the _ingest
metadata that includes the grok_match_index
, as shown in the following output:
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"message": "127.0.0.1 198.126.12 10/Oct/2000:13:55:36 -0700 200",
"response_status": 200,
"clientip": "198.126.12",
"timestamp": "10/Oct/2000:13:55:36 -0700"
},
"_ingest": {
"_grok_match_index": "1",
"timestamp": "2023-11-02T18:48:40.455619084Z"
}
}
}
]
}