OpenSearchCon 2024 Session: You can't test everything but you should monitor it

Platform

Capabilities

Community

Documentation

Most Recent Articles
Generative AI: OpenSearch's journey as an open-source search engine	Mar 26
OpenSearch as a SIEM Solution	Mar 20
GPU-accelerated vector search in OpenSearch: A new frontier	Mar 18
Solution Provider Highlight - Enhancing anomaly detection in Amazon OpenSearc...	Mar 07
Tracking the evolution of OpenSearch performance	Mar 06
Efficient large-scale filtering with bitmap filtering in OpenSearch	Feb 25
Reduce costs with disk-based vector search	Feb 19
From chaos to clarity: Revolutionizing OpenSearch clients and documentation u...	Feb 13
Introducing reciprocal rank fusion for hybrid search	Feb 12
Explore OpenSearch 2.19	Feb 11

I want to present an incident which happend at our warehouse which leads to an OpenSearch use case for metrics and monitoring: We are renting out thousands of photo booths every year with ten thousands of bookings. Most of our processes are fully automated, like the configuration of the photo booths which are connected to the network in the fulfillment or the download of the photos when a photo booth has returned.

In the high season, the photo booths get returned via shipping on the same day as they are getting configured and sent out to the next customer.

For this, the download of several gigabytes of photos must be fast. Normally, there are only 10 minutes between downloading and configuring a photo booth on the shelf again.

From one day to the next, we had issues that the download did not finish in time and it took almost 30 minutes. This disrupted the business a lot. After several hours of debugging, we found out that we had a network issue. After aggregating the data - which took some time - we found out that this error was there from 2020 already - 2 years. But we did not notice because we did not have a monitoring for this and we did not send out that many photo booths because of Covid-19.

This was the day we decided we need monitoring for everything - so we set up OpenSearch, pushed in the metrics including the old one and added alerts so we get noticed in time. By this, we will recognize very early and can take action to get rid of problems.