SQL search relevance, selective aggregation, and window functions in OpenSearch

Mon, Sep 12, 2022 · Chen Dai, Joshua Bright

According to a review by IEEE Spectrum in 2022, SQL is the sixth most popular programming language. IEEE came to this conclusion by pulling and weighting data across GitHub, Google, Stack Overflow, Twitter, and IEEE Xplore. Did you know that OpenSearch offers a way to query OpenSearch using SQL? In recent releases, the SQL plugin included support for search relevance, selective aggregation, and window functions. You can use the REST API or use the OpenSearch Query Workbench to query OpenSearch using the SQL plugin. Today, we will use the Query Workbench inside the OpenSearch Playground environment to walk through search relevance, selective aggregation, and window functions.

We’ll use sample data that everyone can access through the playground environment or the demo Docker install. First, add “Sample web logs” if they aren’t already added. Next, open the Query Workbench. Now let’s dive in. Let’s say that you were a security researcher and wanted to keep track of your system activity. You can use search relevance, selective aggregation, and window functions to monitor this activity.

Search relevance

Search relevance is a powerful tool that allows you to search through data, such as logs, to find specific documents. In the example below you can use count to determine how many logs fit your selected criteria.

Search through logs by client IP addresses and error counts for GET requests resulting in a 503 server error:

SELECT clientip, COUNT(*) AS cnt
FROM opensearch_dashboards_sample_data_logs
WHERE MATCH(message, "GET 503", operator="AND")
GROUP BY clientip
ORDER BY cnt DESC

Additionally, the highlight function returns text with matched terms highlighted:

SELECT clientip, HIGHLIGHT(message)
FROM opensearch_dashboards_sample_data_logs
WHERE MATCH_PHRASE(message, "Linux x86_64")

Selective aggregation

Selective aggregation is useful when you want to understand a total of something across an entire dataset. The SQL plugin will also allow you to filter with certain criteria. Say you want to keep track of large file transfers in your system in order to react to someone moving sensitive data outside of your place of work. In the example below, selective aggregation is used to determine how many HTTP requests were submitted where the bytes count was higher than 10,000:

SELECT
    COUNT(*) AS totalReq,
    COUNT(*) FILTER(WHERE bytes > 10000) AS totalLargeReq
FROM opensearch_dashboards_sample_data_logs

Window functions

Aggregate window functions provide users an easy way to calculate aggregate results over a custom window of time. For example, if you wanted to understand how many web calls occurred each day in order to identify abnormal traffic patterns, you could use the following statement:

SELECT
    date,
    SUM(dailyCnt) OVER(ORDER BY date) AS cumulative
FROM (
    SELECT
     DATE_FORMAT(timestamp, '%Y/%m/%d') AS date,
     COUNT(*) AS dailyCnt
    FROM opensearch_dashboards_sample_data_logs
    GROUP BY date
) AS tmp
ORDER BY date

But what if you wanted to monitor the top 10 websites visited by employees of your company? You could use a classic top-K elements per group problem. The following statement queries the top 10 most visited URLs for each day by using the ranking window function:

SELECT date, url, cnt
FROM (
    SELECT
     date, url, cnt,
     RANK() OVER(PARTITION BY date ORDER BY cnt DESC) AS rnk
    FROM (
      SELECT
        DATE_FORMAT(timestamp, '%Y/%m/%d') AS date,
        url AS url,
        COUNT(*) AS cnt
      FROM opensearch_dashboards_sample_data_logs
      GROUP BY date, url
    ) AS a
) AS r
WHERE rnk <= 10
ORDER BY date, rnk

How are you using OpenSearch’s SQL in your environment? Do you have use cases that have worked for you? We would love to hear about it! As always, if you don’t see functionality you would like or want to get involved, please visit the SQL Github repository to file an issue.