Streaming Analytics in OpenSearch

Mon, Aug 02, 2021 · Sudipto Guha, Joshua Tokle

This post introduces an updated Random Cut Forest library, RCF 2.0. OpenSearch uses RCF 1.0 in its existing High Cardinality Anomaly Detection (HCAD) framework [1,2,3]. The sequel discusses the perspective motivating RCF 2.0.

Anomaly detection is a quintessential search problem. A typical use case corresponds to a high cardinality dataset, where some attributes split the data into a large number of individual, and potentially incomparable, time series and one simultaneously monitors each of those time series. Anomalies are often only explainable in the context of past data specific to each time series. The use case exemplifies the needle-in-an-unfamiliar-haystack search task. The HCAD in OpenSearch [2,3] provides users an out of the box solution for this use case. See [4] for discussions on using the feature in OpenSearch.

However a truly high cardinality scenario requires that models are stored on disk and loaded into memory on demand to save heap space. Loading the models repeatedly however, hearkens back to rebuilding models on every data value which is a bottleneck for scale. Imagine rebuilding hash tables every time a new value is seen — further, imagine rebuilding an ensemble of 50 such hash tables with the same entries in them!

RCF is a subclass of Random Forests and some of the critical criteria for choosing the RCF models in OpenSearch correspond to light footprint, accuracy, explainability as well as the ability of fast streaming updates where models are not rebuilt from scratch on every new observation [1]. Technically, RCF uses the theme of stochastic coupling in a strong sense — where assuming one has sampled from a distribution of trees, the update preserves the distributional assumption. Just as streaming algorithms can support continuous learning without additional expansive pipelines, RCF 1.0 supports some examples of continuous learning over Random Forests. However as is typical in many streaming algorithms, the existing RCF 1.0 library does not address stop-store-reload-and-go behavior well. Trees are more complicated data structures in comparison to hash tables and the repeated reloading of ensemble forests impacts scalability. Furthermore there are multiple potential tradeoffs between representation and update efficiency for trees. RCF 2.0 threads that tradeoff needle – on synthetic benchmarks, available with the library, the model size reduces by 5-10x and speed of serialization/deserialization increases by an order of magnitude in comparison to RCF 1.0, thereby improving the scalability of HCAD and potentially reducing cost.

RCF 2.0 provides a new compact, efficiently serializable ensemble forest that supports the application where the forest itself can be a streamed input in addition to new data. Before discussing the improvements, let us reconsider RCFs in greater detail for a moment. The central tenet behind RCF had been to “make ingest faster and move complexity to the inference step” [5]. This was achieved by stochastic coupling mentioned earlier. RCF 2.0 doubles down on the same idea, and “attempt to make updates information theoretically leaner” instead of just improving computational speed. Random Forests are traditionally viewed as top-down partitioning algorithms; a collection of points are recursively partitioned based on some criterion. RCF 1.0 espoused a similar conceptual top-down view for updates, but inference (by necessity) was bottom up. RCF 2.0 reuses stochastic coupling to simulate the top down update via a bottom up process — ensuring that the end result of the new update is distributionally indistinguishable from the conceptual top-down model update. As a consequence, RCF 2.0 avoids the necessity of storing bounding boxes at every node, and can recreate exactly those bounding boxes which are necessary. This provides a significant reduction of heap memory and allows greater multitenancy of models. On the aforementioned synthetic benchmarks, the heap memory improvement ranges from 1.5x to 7x, with corresponding changes in throughput (based on input parameters, not necessarily linear). In the specific parameter settings of HCAD in OpenSearch, there appears to be a 2-4x improvement in heap size, allowing greater scalability. Stay tuned for the upcoming releases for more specific details.

RCF 2.0 provides a few additional benefits. A model can be restored to exactly replicate an unrestored model which may be of independent use from a broader systems perspective. From an inference perspective, the functions remain mostly unchanged — the extrapolation is refactored to expose intermediate computations which act as if providing a conditional field which is likely to be use in the future. We reiterate that anomaly detection is typically the mere tip of the proverbial iceberg and the starting point of inquiry [5]. A typical use case of detection is followed up with variety of explanatory queries such as predicted/imputed value, density around the point, nearest neighbor — and indeed multitudes of good anomaly detection algorithms exist which leverage these explanatory facets to detect anomalies. However this provides a very interesting opportunity. Suppose that there is an unspecified statistical measure that indicates anomalousness in some scenario; suppose further we have a reasonably good dynamic anomaly detection algorithm over streaming data — then it stands to reason that the dynamic data structures and information used by said streaming anomaly detector should contain relevant information about that unspecified measure. This therefore can be a reasonable recipe to provide a streaming algorithm for that same said measure by reusing the dynamic data structure with minimal extra effort. And drawing on inspirations from Random Forests, this opens up a broad array of possibilities.The RCF library provides some examples of dynamic imputation, dynamic density estimation and dynamic near neighbors, as mentioned above. The interested reader can try out the Apache 2.0 library available at [6].

This task of finding a vanishing-needle-in-an-everchanging-haystack; deriving ephemeral insights from data, possibly with some approximation, as the data is being observed and transformed on the fly, is a core aspiration of both streaming analytics and OpenSearch. Size and scalability of ML/Analytics models remains a challenging frontier and insights from streaming algorithms provides useful tools. The sequential decision making in streaming, where we decide on processing the current input completely before seeing the next input, is the ultimate what-you-compute-is-what-you-have in the context of analytics. Costs, latencies, and general behavior is significantly more predictable with a streaming approach and the possibility of dynamic computation of different measures provides exciting opportunities ahead for analytics in OpenSearch.