Announcing Data Prepper 2.7.0

Wed, Mar 27, 2024 · David Venable, George Chen

Data Prepper 2.7.0 is now available for download. This release supports extracting geographic locations from IP addresses, supports injectable secrets, and adds many new processors.

GeoIP processor

Data Prepper can now enrich events with geographical location data from an IP address using the new geoip processor. The geoip processor uses the MaxMind GeoLite2 databases to provide geographical location data from IP addresses.

Many OpenSearch and Data Prepper users want to enrich their data by adding geographical locations to events. There are a number of reasons this data can be valuable. Some examples include customer analytics, looking for anomalies in network access, understanding load across geographies, and more. An industry solution for determining a geographical location is through the use of IP addresses.

One example scenario is locating users of a web server. Data Prepper already supports parsing Apache Common Log Format for Apache HTTP servers in the grok processor. The following example shows how you can now locate the client making requests using the clientip property extracted from the grok processor:

  - grok:

  - geoip:
        - source: clientip
          target: clientlocation
          include_fields: [latitude, longitude, location, postal_code, country_name, city_name]

When ingesting data using this pipeline, the OpenSearch index will now contain the geolocation fields expressed above, such as the latitude and city_name. Additionally, you can configure template mappings in OpenSearch so that you can display these events in OpenSearch Dashboards using the Maps feature.

AWS Secrets Manager support

Data Prepper now supports AWS Secrets Manager as an extension plugin applicable to pipeline plugins (source, buffer, processor, sink). Users are allowed to configure the AWS Secrets Manager extension through extensions in data-prepper-config.yaml.

The following example shows how you can configure your secrets:

        secret_id: <YOUR_SECRET_ID_1>
        region: <YOUR_REGION_1>
        sts_role_arn: <YOUR_STS_ROLE_ARN_1>
        refresh_interval: <YOUR_REFRESH_INTERVAL_1>
        secret_id: <YOUR_SECRET_ID_2>
        region: <YOUR_REGION_2>
        sts_role_arn: <YOUR_STS_ROLE_ARN_2>
        refresh_interval: <YOUR_REFRESH_INTERVAL_2>

Users can also configure secrets in the pipeline_configurations section of a pipeline YAML file.

The credential-secret-config term in the example above is a user-supplied secret configuration ID. Pipeline authors can reference secrets within pipeline plugin settings using the pattern $aws_secrets:<<my-defined-secret>>`. The following example shows how to configure an OpenSearch sink with secret values:

  - opensearch:
      hosts: [ "${{aws_secrets:host-secret-config}}" ]
      username: "${{aws_secrets:credential-secret-config:username}}"
      password: "${{aws_secrets:credential-secret-config:password}}"

In this example, secrets under credential-secret-config are assumed to be stored as the following JSON key-value pairs:

  "username": <YOUR_USERNAME>
  "password": <YOUR_PASSWORD>

The secret under host-secret-config is assumed to be stored as plaintext. To support secret rotation for OpenSearch, the opensearch source automatically refreshes its basic credentials, (username/password) according to the refresh_interval by polling the latest secret values.

For more information, please see the aws extension documentation.

Note that this feature is currently experimental, and we are working to add support for refreshing and dynamically updating certain fields. In particular, the opensearch sink and the kafka plugins do not automatically refresh secrets.

Other features

  • Data Prepper can now parse XML data in fields using the parse_xml processor.
  • The new parse_ion processor can parse fields in the Amazon Ion format.
  • Some users have fields that are gzip-compressed at the field level. These users can decompress those fields using the decompress processor.
  • Data Prepper can now join strings from multiple strings, including with a delimiter.
  • The new select_entries processor allows users to select only the necessary fields from events. This can simplify how users filter unnecessary data.
  • Users who wish to reduce the size of fields in OpenSearch can use the truncate processor, which truncates strings to a configurable maximum length.
  • The file source now supports codecs. This can help you test a pipeline locally before using the s3 source.

Getting started

Thanks to our contributors!

The following people contributed to this release. Thank you!