Use flat object in OpenSearch

Tue, Jun 13, 2023 · Mingshi Liu, Fanit Kolchina

OpenSearch 2.7 introduced the new flat_object field type. This field type is useful for objects with a large number of fields or when you are not familiar with the field names in your documents. The flat_object field type treats the entire JSON object as a string. Subfields within the JSON object are accessible using the flat_object field name or the standard dot path notation, but they are not indexed for fast lookup. In this post, we explore how flat object simplifies mapping data structures and enhances the search experience in OpenSearch.

Dynamic mapping

In OpenSearch, a mapping defines the structure of your data. It specifies field names, types, and indexing and analysis settings, ensuring that your data is organized and interpreted correctly. If you don’t specify a custom mapping, OpenSearch infers the structure of your document automatically when you upload the document. This process is called dynamic mapping, where OpenSearch detects the document data structure and generates the corresponding mapping file.

When dynamic mapping falls flat

When documents have complex data structures or deeply nested fields, relying on dynamic mapping can lead to the number of mapped fields in an index quickly growing to hundreds or even thousands. This “mapping explosion” negatively impacts the performance of your cluster.

Symptoms of mapping explosion include:

  • OutOfMemoryError: If the mapping becomes too large to fit in memory, you may encounter an OutOfMemoryError, resulting in the unavailability of the cluster or nodes.
  • MapperParsingException: If the number of unique field names becomes too large, the cluster may throw exceptions such as MapperParsingException or IllegalArgumentException, indicating that the mapping update has failed.
  • Performance degradation: As the mapping grows, the performance of indexing and search operations can degrade. Handling a large number of fields requires more resources and processing time, potentially leading to slower indexing and increased query latencies.
  • Increased storage requirements: Each field in the mapping needs storage space. With a mapping explosion, the index’s storage requirements can significantly increase, impacting disk space utilization and potentially leading to resource constraints.

Additionally, searching through deeply nested indexes with lengthy dot paths can be inconvenient, especially if you are unfamiliar with the document structure. Flat object solves both of these problems.

Use case

To demonstrate a real-life use case for the flat_object field type, we’ll use the new ML Commons remote model inference project, in which you can store and search template documents. Some of the fields in the machine learning model template are user-defined key-value pairs. Because those are created by the user on the fly, it is difficult to predefine the mappings for the index that stores these documents.

Example documents

For example, consider the following two template documents that connect OpenSearch to OpenAI and Amazon BedRock for model inference:

PUT test-index/_doc/1 
{
    "Metadata":{
        "connector_name": "OpenAI Connector",
        "description": "The connector to public OpenAI model service for GPT 3.5",
        "version": 1
    },
    "Parameters": {
        "endpoint": "api.openai.com",
        "protocol": "HTTP",
        "auth": "API_Key",
        "content_type" : "application/json",
        "model": "gpt-3.5-turbo"
    }
}
PUT test-index/_doc/2 
{
    "Metadata":{
        "connector_name": "Amazon BedRock",
        "description": "The connector to Bedrock for the generative AI models",
        "version": 2
    },
   "Parameters": {
        "label": "default_label",
        "host": "localhost",
        "port": 8080,
        "protocol": "HTTP",
        "auth": "API_Key",
        "content_type" : "application/json",
        "policy":{
            "policy_id":"p_0001",
            "policy_name":"default_policy"
            }
    } 
}

Mapping without flat object

If you don’t specify mappings for the test-index and let OpenSearch apply dynamic mappings, one JSON document uploaded to an index causes OpenSearch to generate a mapping for every field and subfield. Thus, OpenSearch produces the following mapping, where you can trace every field and subfield of the preceding documents:

{
  "test-index": {
    "mappings": {
      "properties": {
        "Metadata": {
          "properties": {
            "connector_name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "description": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "version": {
              "type": "long"
            }
          }
        },
        "Parameters": {
          "properties": {
            "auth": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "content_type": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "endpoint": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "host": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "label": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "model": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "policy": {
              "properties": {
                "policy_id": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "policy_name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            },
            "port": {
              "type": "long"
            },
            "protocol": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      }
    }
  }
}

However, often a model service has too many parameters or the number of model services increases when every model service has different parameters. With dynamic mapping, because every subfield is an indexable field, the mapping file can grow enormously.

Searching without flat object

When searching for a model parameter, you need to know the dot path to the subfield in advance. For example, if you are searching for a policy with id p_0001, you need to use the exact dot path Parameters.policy.policy_id:

GET /test-index/_search
{
  "query": {
    "match": {"Parameters.policy.policy_id": "p_0001"}
  }
}

Mapping with flat object

Using the flat_object field type, you can save the entire Parameters fields as a string rather than JSON object and not specify the field names for its subfields:

PUT /test-index/
{
  "mappings": {
    "properties": {
      "Parameters": {
        "type": "flat_object"
      }
    }
  }
}

After uploading the same documents, you can check the mappings for test-index:

GET /test-index/_mappings

The Parameters field, which is mapped as a flat_object, is the only indexable field. Its subfields are not indexed, effectively preventing a mapping explosion:

{
  "test-index": {
    "mappings": {
      "properties": {
        "Metadata": {
          "properties": {
            "connector_name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "description": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "version": {
              "type": "long"
            }
          }
        },
        "Parameters": {
          "type": "flat_object"
        }
      }
    }
  }
}

Searching with flat object

When searching for a model parameter, you can the use the flat_object field name, Parameters:

GET /test-index/_search
{
  "query": {
    "match": {"Parameters": "p_0001"}
  }
}

Alternatively, you can choose to use the standard dot path notation for convenient exact match search:

GET /test-index/_search
{
  "query": {
    "match": {"Parameters.policy.policy_id": "p_0001"}
  }
}

In both cases, the correct document is returned:

{
  "took" : 142,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0601075,
    "hits" : [
      {
        "_index" : "test-index",
        "_id" : "2",
        "_score" : 1.0601075,
        "_source" : {
          "Metadata" : {
            "connector_name" : "Amazon BedRock",
            "description" : "The connector to Bedrock for the generative AI models",
            "version" : 2
          },
          "Parameters" : {
            "label" : "default_label",
            "host" : "localhost",
            "port" : 8080,
            "protocol" : "HTTP",
            "auth" : "API_Key",
            "content_type" : "application/json",
            "policy" : {
              "policy_id" : "p_0001",
              "policy_name" : "default_policy"
            }
          }
        }
      }
    ]
  }
}

Next steps

For more information about capabilities and limitations of flat object, see the flat object documentation.

We’re adding the ability to search subfields in a flat object using a Painless script. See the GitHub issue for details. Also, we are adding support for open parameters to flat object.

To learn more about the new ML Commons remote model inference project mentioned in this post, see Extensibility for OpenSearch Machine Learning.

Our contributors

The following community members contributed to the flat object implementation:

Thank you for your contribution!