Text processing
Data Prepper provides text processing capabilities with the grok processor
. The grok
processor is based on the java-grok
library and supports all compatible patterns. The java-grok
library is built using the java.util.regex
regular expression library.
You can add custom patterns to your pipelines by using the patterns_definitions
option. When debugging custom patterns, the Grok Debugger can be helpful.
Basic usage
To get started with text processing, create the following pipeline:
patten-matching-pipeline:
source
...
processor:
- grok:
match:
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
sink:
- opensearch:
# Provide an OpenSearch cluster endpoint
An incoming message might contain the following contents:
{"message": "127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200"}
In each incoming event, the pipeline will locate the value in the message
key and attempt to match the pattern. The keywords IPORHOST
, HTTPDATE
, and NUMBER
are built into the plugin.
When an incoming record matches the pattern, it generates an internal event such as the following with identification keys extracted from the original message:
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"response_status":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
The match
configuration for the grok
processor specifies which record keys to match against which patterns.
In the following example, the match
configuration checks incoming logs for a message
key. If the key exists, it matches the key value against the SYSLOGBASE
pattern and then against the COMMONAPACHELOG
pattern. It then checks the logs for a timestamp
key. If that key exists, it attempts to match the key value against the TIMESTAMP_ISO8601
pattern.
processor:
- grok:
match:
message: ['%{SYSLOGBASE}', "%{COMMONAPACHELOG}"]
timestamp: ["%{TIMESTAMP_ISO8601}"]
By default, the plugin continues until it finds a successful match. For example, if there is a successful match against the value in the message
key for a SYSLOGBASE
pattern, the plugin doesn’t attempt to match the other patterns. If you want to match logs against every pattern, include the break_on_match
option.
Including named and empty captures
Include the keep_empty_captures
option in your pipeline configuration to include null captures or the named_captures_only
option to include only named captures. Named captures follow the pattern %{SYNTAX:SEMANTIC}
while unnamed captures follow the pattern %{SYNTAX}
.
For example, you can modify the preceding Grok configuration to remove clientip
from the %{IPORHOST}
pattern:
processor:
- grok:
match:
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
The resulting grokked log will look like this:
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"response_status":200,
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
Notice that the clientip
key no longer exists because the %{IPORHOST}
pattern is now an unnamed capture.
However, if you set named_captures_only
to false
:
processor:
- grok:
match:
named_captures_only: false
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
Then the resulting grokked log will look like this:
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"MONTH":"Oct",
"YEAR":"2000",
"response_status":200,
"HOUR":"13",
"TIME":"13:55:36",
"MINUTE":"55",
"SECOND":"36",
"IPORHOST":"198.126.12",
"MONTHDAY":"10",
"INT":"-0700",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
Note that the IPORHOST
capture now shows up as a new key, along with some internal unnamed captures like MONTH
and YEAR
. The HTTPDATE
keyword is currently using these patterns, which you can see in the default patterns file.
Overwriting keys
Include the keys_to_overwrite
option to specify which existing record keys to overwrite if there is a capture with the same key value.
For example, you can modify the preceding Grok configuration to replace %{NUMBER:response_status:int}
with %{NUMBER:message:int}
and add message
to the list of keys to overwrite:
processor:
- grok:
match:
keys_to_overwrite: ["message"]
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
In the resulting grokked log, the original message is overwritten with the number 200
:
{
"message":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
Using custom patterns
Include the pattern_definitions
option in your Grok configuration to specify custom patterns.
The following configuration creates custom regex patterns named CUSTOM_PATTERN-1
and CUSTOM_PATTERN-2
. By default, the plugin continues until it finds a successful match.
processor:
- grok:
pattern_definitions:
CUSTOM_PATTERN_1: 'this-is-regex-1'
CUSTOM_PATTERN_2: '%{CUSTOM_PATTERN_1} REGEX'
match:
message: ["%{CUSTOM_PATTERN_2:my_pattern_key}"]
If you specify break_on_match
as false
, the pipeline attempts to match all patterns and extract keys from the incoming events:
processor:
- grok:
pattern_definitions:
CUSTOM_PATTERN_1: 'this-is-regex-1'
CUSTOM_PATTERN_2: 'this-is-regex-2'
CUSTOM_PATTERN_3: 'this-is-regex-3'
CUSTOM_PATTERN_4: 'this-is-regex-4'
match:
message: [ "%{PATTERN1}”, "%{PATTERN2}" ]
log: [ "%{PATTERN3}", "%{PATTERN4}" ]
break_on_match: false
You can define your own custom patterns to use for pipeline pattern matching. In the previous example, my_pattern
will be extracted after matching the custom patterns.
Storing captures with a parent key
Include the target_key
option in your Grok configuration to wrap all record captures in an additional outer key value.
For example, you can modify the preceding Grok configuration to add a target key named grokked
:
processor:
- grok:
target_key: "grokked"
match:
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
The resulting grokked log will look like this:
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"grokked": {
"response_status":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
}