Derived field type
Introduced 2.15
Derived fields allow you to create new fields dynamically by executing scripts on existing fields. The existing fields can be either retrieved from the _source
field, which contains the original document, or from a field’s doc values. Once you define a derived field either in an index mapping or within a search request, you can use the field in a query in the same way you would use a regular field.
When to use derived fields
Derived fields offer flexibility in field manipulation and prioritize storage efficiency. However, because they are computed at query time, they can reduce query performance. Derived fields are particularly useful in scenarios requiring real-time data transformation, such as:
- Log analysis: Extracting timestamps and log levels from log messages.
- Performance metrics: Calculating response times from start and end timestamps.
- Security analytics: Real-time IP geolocation and user-agent parsing for threat detection.
- Experimental use cases: Testing new data transformations, creating temporary fields for A/B testing, or generating one-time reports without altering mappings or reindexing data.
Despite the potential performance impact of query-time computations, the flexibility and storage efficiency of derived fields make them a valuable tool for these applications.
Current limitations
Currently, derived fields have the following limitations:
- Scoring and sorting: Not yet supported.
- Aggregations: Starting with OpenSearch 2.17, derived fields support most aggregation types. The following aggregations are not supported: geographic (geodistance, geohash grid, geohex grid, geotile grid, geobounds, geocentroid), significant terms, significant text, and scripted metric.
- Dashboard support: These fields are not displayed in the list of available fields in OpenSearch Dashboards. However, you can still use them for filtering if you know the derived field name.
- Chained derived fields: One derived field cannot be used to define another derived field.
- Join field type: Derived fields are not supported for the join field type.
We are planning to address these limitations in future versions.
Prerequisites
Before using a derived field, be sure to satisfy the following prerequisites:
- Enable
_source
ordoc_values
: Ensure that either the_source
field or doc values is enabled for the fields used in your script. - Enable expensive queries: Ensure that
search.allow_expensive_queries
is set totrue
. - Feature control: Derived fields are enabled by default. You can enable or disable derived fields by using the following settings:
- Index level: Update the
index.query.derived_field.enabled
setting. - Cluster level: Update the
search.derived_field.enabled
setting. Both settings are dynamic, so they can be changed without reindexing or node restarts.
- Index level: Update the
- Performance considerations: Before using derived fields, evaluate the performance implications to ensure that derived fields meet your scale requirements.
Defining derived fields
You can define derived fields in index mappings or directly within a search request.
Example setup
To try the examples on this page, first create the following logs
index:
PUT logs
{
"mappings": {
"properties": {
"request": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"clientip": {
"type": "keyword"
}
}
}
}
Add sample documents to the index:
POST _bulk
{ "index" : { "_index" : "logs", "_id" : "1" } }
{ "request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778", "clientip": "61.177.2.0" }
{ "index" : { "_index" : "logs", "_id" : "2" } }
{ "request": "894140400 GET /french/playing/mascot/mascot.html HTTP/1.1 200 5474", "clientip": "185.92.2.0" }
{ "index" : { "_index" : "logs", "_id" : "3" } }
{ "request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711", "clientip": "61.177.2.0" }
{ "index" : { "_index" : "logs", "_id" : "4" } }
{ "request": "894360400 POST /images/home_fr_button.gif HTTP/1.1 200 2140", "clientip": "129.178.2.0" }
{ "index" : { "_index" : "logs", "_id" : "5" } }
{ "request": "894470400 DELETE /images/102384s.gif HTTP/1.0 200 785", "clientip": "227.177.2.0" }
Defining derived fields in index mappings
To derive the timestamp
, method
, and size
fields from the request
field indexed in the logs
index, configure the following mappings:
PUT /logs/_mapping
{
"derived": {
"timestamp": {
"type": "date",
"format": "MM/dd/yyyy",
"script": {
"source": """
emit(Long.parseLong(doc["request.keyword"].value.splitOnToken(" ")[0]))
"""
}
},
"method": {
"type": "keyword",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[1])
"""
}
},
"size": {
"type": "long",
"script": {
"source": """
emit(Long.parseLong(doc["request.keyword"].value.splitOnToken(" ")[5]))
"""
}
}
}
}
Note that the timestamp
field has an additional format
parameter that specifies the format in which to display date
fields. If you don’t include a format
parameter, then the format defaults to strict_date_time_no_millis
. For more information about supported date formats, see Parameters.
Parameters
The following table lists the parameters accepted by derived
field types. All parameters are dynamic and can be modified without reindexing documents.
Parameter | Required/Optional | Description |
---|---|---|
type | Required | The type of the derived field. Supported types are boolean , date , geo_point , ip , keyword , text , long , double , float , and object . |
script | Required | The script associated with the derived field. Any value emitted from the script must be emitted using emit() . The type of the emitted value must match the type of the derived field. Scripts have access to both the doc_values and _source fields if those are enabled. The doc value of a field can be accessed using doc['field_name'].value , and the source can be accessed using params._source["field_name"] . |
format | Optional | The format used for parsing dates. Only applicable to date fields. Valid values are strict_date_time_no_millis , strict_date_optional_time , and epoch_millis . For more information, see Formats. |
ignore_malformed | Optional | A Boolean value that specifies whether to ignore malformed values when running a query on a derived field. Default value is false (throw an exception when encountering malformed values). |
prefilter_field | Optional | An indexed text field provided to boost the performance of derived fields. Specifies an existing indexed field on which to filter prior to filtering on the derived field. For more information, see Prefilter field. |
Emitting values in scripts
The emit()
function is available only within the derived field script context. It is used to emit one or multiple (for a multi-valued field) script values for a document on which the script runs.
The following table lists the emit()
function formats for the supported field types.
Type | Emit format | Multi-valued fields supported |
---|---|---|
boolean | emit(boolean) | No |
double | emit(double) | Yes |
date | emit(long timeInMilis) | Yes |
float | emit(float) | Yes |
geo_point | emit(double lat, double lon) | Yes |
ip | emit(String ip) | Yes |
keyword | emit(String) | Yes |
long | emit(long) | Yes |
object | emit(String json) (valid JSON) | Yes |
text | emit(String) | Yes |
By default, a type mismatch between a derived field and its emitted value will result in the search request failing with an error. If ignore_malformed
is set to true
, then the failing document is skipped and the search request succeeds.
The size limit of the emitted values is 1 MB per document.
Searching derived fields defined in index mappings
To search derived fields, use the same syntax as when searching regular fields. For example, the following request searches for documents with derived timestamp
field in the specified range:
POST /logs/_search
{
"query": {
"range": {
"timestamp": {
"gte": "1970-01-11T08:20:30.400Z",
"lte": "1970-01-11T08:26:00.400Z"
}
}
},
"fields": ["timestamp"]
}
The response contains the matching documents:
Response
{
"took": 315,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 1,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:20:30.400Z"
]
}
},
{
"_index": "logs",
"_id": "2",
"_score": 1,
"_source": {
"request": "894140400 GET /french/playing/mascot/mascot.html HTTP/1.1 200 5474",
"clientip": "185.92.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:22:20.400Z"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 1,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:24:10.400Z"
]
}
},
{
"_index": "logs",
"_id": "4",
"_score": 1,
"_source": {
"request": "894360400 POST /images/home_fr_button.gif HTTP/1.1 200 2140",
"clientip": "129.178.2.0"
},
"fields": {
"timestamp": [
"1970-01-11T08:26:00.400Z"
]
}
}
]
}
}
Defining and searching derived fields in a search request
You can also define derived fields directly in a search request and query them along with regular indexed fields. For example, the following request creates the url
and status
derived fields and searches those fields along with the regular request
and clientip
fields:
POST /logs/_search
{
"derived": {
"url": {
"type": "text",
"script": {
"source": """
emit(doc["request"].value.splitOnToken(" ")[2])
"""
}
},
"status": {
"type": "keyword",
"script": {
"source": """
emit(doc["request"].value.splitOnToken(" ")[4])
"""
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"clientip": "61.177.2.0"
}
},
{
"match": {
"url": "images"
}
},
{
"term": {
"status": "200"
}
}
]
}
},
"fields": ["request", "clientip", "url", "status"]
}
The response contains the matching documents:
Response
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2.8754687,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 2.8754687,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/images/france98_venues.gif"
],
"status": [
"200"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 2.8754687,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/venues/images/venue_header.gif"
],
"status": [
"200"
]
}
}
]
}
}
Derived fields use the default analyzer specified in the index analysis settings during search. You can override the default analyzer or specify a search analyzer within a search request in the same way as with regular fields. For more information, see Analyzers.
When both an index mapping and a search definition are present for a field, the search definition takes precedence.
Retrieving fields
You can retrieve derived fields using the fields
parameter in the search request in the same way as with regular fields, as shown in the preceding examples. You can also use wildcards to retrieve all derived fields that match a given pattern.
Highlighting
Derived fields of type text
support highlighting using the unified highlighter. For example, the following request specifies to highlight the derived url
field:
POST /logs/_search
{
"derived": {
"url": {
"type": "text",
"script": {
"source": """
emit(doc["request"].value.splitOnToken(" " )[2])
"""
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"clientip": "61.177.2.0"
}
},
{
"match": {
"url": "images"
}
}
]
}
},
"fields": ["request", "clientip", "url"],
"highlight": {
"fields": {
"url": {}
}
}
}
The response specifies highlighting in the url
field:
Response
{
"took": 45,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.8754687,
"hits": [
{
"_index": "logs",
"_id": "1",
"_score": 1.8754687,
"_source": {
"request": "894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894030400 GET /english/images/france98_venues.gif HTTP/1.0 200 778"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/images/france98_venues.gif"
]
},
"highlight": {
"url": [
"/english/<em>images</em>/france98_venues.gif"
]
}
},
{
"_index": "logs",
"_id": "3",
"_score": 1.8754687,
"_source": {
"request": "894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711",
"clientip": "61.177.2.0"
},
"fields": {
"request": [
"894250400 POST /english/venues/images/venue_header.gif HTTP/1.0 200 711"
],
"clientip": [
"61.177.2.0"
],
"url": [
"/english/venues/images/venue_header.gif"
]
},
"highlight": {
"url": [
"/english/venues/<em>images</em>/venue_header.gif"
]
}
}
]
}
}
Aggregations
Starting with OpenSearch 2.17, derived fields support most aggregation types.
Geographic, significant terms, significant text, and scripted metric aggregations are not supported.
For example, the following request creates a simple terms
aggregation on the method
derived field:
POST /logs/_search
{
"size": 0,
"aggs": {
"methods": {
"terms": {
"field": "method"
}
}
}
}
The response contains the following buckets:
Response
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"methods" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "GET",
"doc_count" : 2
},
{
"key" : "POST",
"doc_count" : 2
},
{
"key" : "DELETE",
"doc_count" : 1
}
]
}
}
}
Performance
Derived fields are not indexed but are computed dynamically by retrieving values from the _source
field or doc values. Thus, they run more slowly. To improve performance, try the following:
- Prune the search space by adding query filters on indexed fields in conjunction with derived fields.
- Use doc values instead of
_source
in the script for faster access, whenever applicable. - Consider using a
prefilter_field
to automatically prune the search space without explicit filters in the search request.
Prefilter field
Specifying a prefilter field helps to prune the search space without adding explicit filters in the search request. The prefilter field specifies an existing indexed field (prefilter_field
) on which to filter automatically when constructing the query. The prefilter_field
must be a text field (either text
or match_only_text
).
For example, you can add a prefilter_field
to the method
derived field. Update the index mapping, specifying to prefilter on the request
field:
PUT /logs/_mapping
{
"derived": {
"method": {
"type": "keyword",
"script": {
"source": """
emit(doc["request.keyword"].value.splitOnToken(" ")[1])
"""
},
"prefilter_field": "request"
}
}
}
Now search using a query on the method
derived field:
POST /logs/_search
{
"profile": true,
"query": {
"term": {
"method": {
"value": "GET"
}
}
},
"fields": ["method"]
}
OpenSearch automatically adds a filter on the request
field to your query:
"#request:GET #DerivedFieldQuery (Query: [ method:GET])"
You can use the profile
option to analyze derived field performance, as shown in the preceding example.
Derived object fields
A script can emit a valid JSON object so that you can query subfields without indexing them, in the same way as with regular fields. This is useful for large JSON objects that require occasional searches on some subfields. In this case, indexing the subfields is expensive, while defining derived fields for each subfield also adds a lot of resource overhead. If you don’t explicitly provide the subfield type, then the subfield type is inferred.
For example, the following request defines a derived_request_object
derived field as an object
type:
PUT logs_object
{
"mappings": {
"properties": {
"request_object": { "type": "text" }
},
"derived": {
"derived_request_object": {
"type": "object",
"script": {
"source": "emit(params._source[\"request_object\"])"
}
}
}
}
}
Consider the following documents, in which the request_object
is a string representation of a JSON object:
POST _bulk
{ "index" : { "_index" : "logs_object", "_id" : "1" } }
{ "request_object": "{\"@timestamp\": 894030400, \"clientip\":\"61.177.2.0\", \"request\": \"GET /english/venues/images/venue_header.gif HTTP/1.0\", \"status\": 200, \"size\": 711}" }
{ "index" : { "_index" : "logs_object", "_id" : "2" } }
{ "request_object": "{\"@timestamp\": 894140400, \"clientip\":\"129.178.2.0\", \"request\": \"GET /images/home_fr_button.gif HTTP/1.1\", \"status\": 200, \"size\": 2140}" }
{ "index" : { "_index" : "logs_object", "_id" : "3" } }
{ "request_object": "{\"@timestamp\": 894240400, \"clientip\":\"227.177.2.0\", \"request\": \"GET /images/102384s.gif HTTP/1.0\", \"status\": 400, \"size\": 785}" }
{ "index" : { "_index" : "logs_object", "_id" : "4" } }
{ "request_object": "{\"@timestamp\": 894340400, \"clientip\":\"61.177.2.0\", \"request\": \"GET /english/images/venue_bu_city_on.gif HTTP/1.0\", \"status\": 400, \"size\": 1397}\n" }
{ "index" : { "_index" : "logs_object", "_id" : "5" } }
{ "request_object": "{\"@timestamp\": 894440400, \"clientip\":\"132.176.2.0\", \"request\": \"GET /french/news/11354.htm HTTP/1.0\", \"status\": 200, \"size\": 3460, \"is_active\": true}" }
The following query searches the @timestamp
subfield of the derived_request_object
:
POST /logs_object/_search
{
"query": {
"range": {
"derived_request_object.@timestamp": {
"gte": "894030400",
"lte": "894140400"
}
}
},
"fields": ["derived_request_object.@timestamp"]
}
The response contains the matching documents:
Response
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs_object",
"_id": "1",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894030400, "clientip":"61.177.2.0", "request": "GET /english/venues/images/venue_header.gif HTTP/1.0", "status": 200, "size": 711}"""
},
"fields": {
"derived_request_object.@timestamp": [
894030400
]
}
},
{
"_index": "logs_object",
"_id": "2",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894140400, "clientip":"129.178.2.0", "request": "GET /images/home_fr_button.gif HTTP/1.1", "status": 200, "size": 2140}"""
},
"fields": {
"derived_request_object.@timestamp": [
894140400
]
}
}
]
}
}
You can also specify to highlight a derived object field:
POST /logs_object/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"derived_request_object.clientip": "61.177.2.0"
}
},
{
"match": {
"derived_request_object.request": "images"
}
}
]
}
},
"fields": ["derived_request_object.*"],
"highlight": {
"fields": {
"derived_request_object.request": {}
}
}
}
The response adds highlighting to the derived_request_object.request
field:
Response
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "logs_object",
"_id": "1",
"_score": 2,
"_source": {
"request_object": """{"@timestamp": 894030400, "clientip":"61.177.2.0", "request": "GET /english/venues/images/venue_header.gif HTTP/1.0", "status": 200, "size": 711}"""
},
"fields": {
"derived_request_object.request": [
"GET /english/venues/images/venue_header.gif HTTP/1.0"
],
"derived_request_object.clientip": [
"61.177.2.0"
]
},
"highlight": {
"derived_request_object.request": [
"GET /english/venues/<em>images</em>/venue_header.gif HTTP/1.0"
]
}
},
{
"_index": "logs_object",
"_id": "4",
"_score": 2,
"_source": {
"request_object": """{"@timestamp": 894340400, "clientip":"61.177.2.0", "request": "GET /english/images/venue_bu_city_on.gif HTTP/1.0", "status": 400, "size": 1397}
"""
},
"fields": {
"derived_request_object.request": [
"GET /english/images/venue_bu_city_on.gif HTTP/1.0"
],
"derived_request_object.clientip": [
"61.177.2.0"
]
},
"highlight": {
"derived_request_object.request": [
"GET /english/<em>images</em>/venue_bu_city_on.gif HTTP/1.0"
]
}
}
]
}
}
Inferred subfield type
Type inference is based on the same logic as Dynamic mapping. Instead of inferring the subfield type from the first document, a random sample of documents is used to infer the type. If the subfield isn’t found in any documents from the random sample, type inference fails and logs a warning. For subfields that seldom occur in documents, consider defining the explicit field type. Using dynamic type inference for such subfields may result in a query returning no results, like for a missing field.
Explicit subfield type
To define the explicit subfield type, provide the type
parameter in the properties
object. In the following example, the derived_logs_object.is_active
field is defined as boolean
. Because this field is only present in one of the documents, its type inference might fail, so it’s important to define the explicit type:
POST /logs_object/_search
{
"derived": {
"derived_request_object": {
"type": "object",
"script": {
"source": "emit(params._source[\"request_object\"])"
},
"properties": {
"is_active": "boolean"
}
}
},
"query": {
"term": {
"derived_request_object.is_active": true
}
},
"fields": ["derived_request_object.is_active"]
}
The response contains the matching documents:
Response
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "logs_object",
"_id": "5",
"_score": 1,
"_source": {
"request_object": """{"@timestamp": 894440400, "clientip":"132.176.2.0", "request": "GET /french/news/11354.htm HTTP/1.0", "status": 200, "size": 3460, "is_active": true}"""
},
"fields": {
"derived_request_object.is_active": [
true
]
}
}
]
}
}