High-Level Executive Summary
At a major enterprise scale, infrastructure teams were struggling with alert fatigue, slow incident response, and fragmented operational knowledge. Led by Dzung Le at Noventiq, the team implemented a self-healing infrastructure model using agentic AI and vector search in OpenSearch. This approach reduced manual intervention, improved resolution times, and laid the groundwork for autonomous operations while maintaining critical human oversight.
Challenge
Infrastructure operations had reached a breaking point.
Despite advanced observability systems, engineers were still required to manually investigate and resolve incidents, often at inconvenient hours. Many issues were simple to fix but time-consuming to diagnose.
Key challenges included:
- Alert fatigue from excessive monitoring noise
- High mean time to resolution (MTTR) due to manual log analysis
- Tribal knowledge dependency, with fixes buried in tickets or individual expertise
- Customer impact, especially in systems serving millions of users
“We have built systems that are great at yelling at us when they break, but they have no idea of how to fix themselves.”
In one banking-scale deployment supporting over 7 million users, even minor disruptions in AI-driven services could significantly impact user experience and business outcomes.
Solution
Under the guidance of Dzung Le, the team designed a self-healing infrastructure architecture combining agentic AI with vector search capabilities in OpenSearch.
Core Components
Agentic AI System
- Moves beyond chatbots to autonomous agents
- Capable of reasoning, planning, and executing actions
- Structured as:
- Brain: LLM for decision-making
- Eyes: OpenSearch for observability
- Hands: Execution layer for remediation
Vector Search with OpenSearch
- Transforms logs, runbooks, and incidents into semantic embeddings
- Enables contextual matching between errors and solutions
- Eliminates reliance on exact keyword matches
Retrieval-Augmented Generation (RAG)
- Anchors AI decisions in real operational data
- Ensures recommendations are based on verified runbooks
Multi-Agent Orchestration
- Specialized agents for monitoring, investigation, and remediation
- Central orchestrator coordinates workflows
Human-in-the-Loop Safeguards
- Automates low-risk fixes (service restarts, cache clearing)
- Requires approval for high-risk actions (database changes, traffic routing)
How It Works
The system operates as a continuous self-healing loop:
- Detect anomalies through monitoring systems
- Investigate logs automatically via OpenSearch
- Retrieve similar incidents using vector search
- Generate root cause analysis and recommended fix
- Execute remediation with optional human approval
- Store outcomes to continuously improve the knowledge base
Results
The implementation delivered tangible improvements across operations:
- Faster incident resolution by automating investigation and diagnosis
- Reduced operational load, minimizing manual intervention and late-night escalations
- Improved knowledge accessibility, converting static runbooks into a living AI-driven system
- Increased system resilience through multi-model AI and failover strategies
“AI will not replace humans, but humans using AI will replace humans who don’t.”
Why It Matters
This approach represents a shift from reactive infrastructure management to autonomous, intelligent operations.
Instead of systems that only alert humans, organizations can build systems that:
- Understand issues in context
- Learn from past incidents
- Take action safely and efficiently
For enterprises operating at scale, this is the difference between constant firefighting and sustainable reliability.
Call to Action
To start building self-healing infrastructure:
- Centralize your logs, metrics, and runbooks in OpenSearch
- Implement vector search to unlock semantic understanding
- Introduce agentic AI with strict human-in-the-loop controls
- Begin with low-risk automation and scale gradually
Learn more:
- Explore the full talk.
- Discover Noventiq solutions and services