Announcing Data Prepper 2.10.0

Tue, Oct 15, 2024 · Souvik Bose, David Venable

Introduction

Data Prepper 2.10 is now available! Two major features include a source that sends data to Data Prepper using an API mimicking the OpenSearch _bulk API and the ability to read from Amazon Kinesis Data Streams.

OpenSearch API source

Many existing OpenSearch clients that perform ingestion directly to OpenSearch can now send that data to Data Prepper first. This means that you can use Data Prepper’s buffering and rich processor set before sending data to OpenSearch without having to change clients that are using the OpenSearch _bulk API.
A new Data Prepper source named opensearch_api has been added that accepts OpenSearch Document API bulk operation requests from clients using REST and ingests data into OpenSearch. The behavior of this source is also quite similar to the existing http source. It supports industry-standard encryption in the form of TLS/HTTPS and HTTP basic authentication. It also parses incoming requests and creates Data Prepper events and associated event metadata, making it compatible with the opensearch sink. The request body is compatible with the OpenSearch Document API bulk operation and supports all actions: index, create, delete, and update.

The following two HTTP methods are now supported:

POST _bulk
POST <index>/_bulk

The second API specifies the index in the path, so you don’t need to include it in the request body.

Additionally, the following OpenSearch Document API bulk operation query parameters are supported:

  • pipeline
  • routing

The following example demonstrates how to use the source:

version: "2"
opensearch-api-pipeline:
  source:
    opensearch_api:
  sink:
    - opensearch:
        hosts: ["https://opensearch:9200"]
        insecure: true
        username: "admin"
        password: "admin"
        index: "${getMetadata(\"opensearch_index\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_id: "${getMetadata(\"opensearch_id\")}"
        routing: "${getMetadata(\"opensearch_routing\")}"
        pipeline: "${getMetadata(\"opensearch_pipeline\")}"

Consider the following example request:

POST _bulk
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }

This request will be ingested into OpenSearch, and a new document will be created under the index movies with the document ID tt1979320 and the document source { "title": "Rush", "year": 2013 }.

The Data Prepper maintainers are interested in further expanding this source to support other indexing APIs, allowing it to stand in for an OpenSearch cluster in ingestion workloads. To learn more or provide feedback, see Provide an OpenSearch API source #4180.

Kinesis source

Amazon Kinesis Data Streams is a high-speed streaming data service. Data Prepper has also introduced a new source named kinesis that can be used to ingest stream record data from multiple Kinesis data streams into OpenSearch clusters. You can configure it to read stream records from either the oldest untrimmed record or from the most recent record. Moreover, if you enable end-to-end acknowledgements, Kinesis data streams will be checkpointed to prevent duplicate processing of records.

The following is an example pipeline:

version: "2"
kinesis-pipeline:
  source:
    kinesis:
      codec:
        newline:
      streams:
        - stream_name: "MyStream1"
          initial_position: LATEST
          checkpoint_interval: "PT5M"
        - stream_name: "MyStream2"
          # Enable this if ingestion should start from the start of the stream.
          initial_position: EARLIEST
          consumer_strategy: "polling"
          polling:
            max_polling_records: 100
            idle_time_between_reads: "250ms"

Other features and improvements

Data Prepper 2.10 has introduced a number of other improvements:

  • The kafka source now supports authentication with an Apache Kafka cluster using SASL/SCRAM in addition to the SASL/PLAIN authentication provided in previous versions.
  • Data Prepper can now parse OpenTelemetry logs from sources such as Amazon Simple Storage Service (Amazon S3). The new otel_logs codec parses data from OpenTelemetry Protocol (OTLP) JSON-formatted files. Now you can write OpenTelemetry logs from AWS S3 Exporter for OpenTelemetry Collector and read these using Data Prepper.
  • Additionally, the maintainers have worked to improve performance through the addition of an internal cache for event keys. Data Prepper administrators can configure this cache as necessary.

Next steps

Thanks to our contributors!

The following community members contributed to this release. Thank you!