Error log: This error will not appear in your OpenSearch logs. It appears exclusively in your client application’s logs (e.g., your Java/Python/Go application, Logstash, etc.).
None
java.net.SocketTimeoutException: Read timed out
Or (depending on the client library):
None
ConnectionTimeout: ReadTimeoutError(HTTPConnectionPool(...):
Read timed out. (read timeout=10))
Why… is this happening? This is a client-side network error. It is not an OpenSearch error. It means your application sent a request to OpenSearch, and then waited for a response, but the network connection timed out before any response was received.
This is different from the timed_out: true error (Blog #16), which is a successful response from OpenSearch telling you the query gave up. A SocketTimeoutException means the network connection itself gave up.
Common causes:
- A very long query: You sent a query that is taking a very long time (e.g., 60+ seconds), but your client’s network timeout is set to 10 or 30 seconds. The client gives up waiting before the server is finished.
- Long garbage collection (GC) pause: The OpenSearch node processing your request froze for 10-30+ seconds due to a long “Stop-the-World” JVM garbage collection pause. During this pause, it can’t even send a “I’m busy” response, so the client’s socket just times out.
- Network hardware: A firewall, load balancer, or other network device between your client and OpenSearch has its own idle timeout (e.g., 30 seconds) and is killing the connection.
- Node crash: The OpenSearch node crashed (e.g.,
OutOfMemoryError) after receiving your request but before sending a response.
Best Practice:
- Check OpenSearch logs for GC: This is the #1 suspect. Look in your
opsearch.logandgc.logfiles on the nodes at the exact time of the client timeout. Look for messages about long GC pauses (e.g., “gc,” “pause,” “duration”). If you see pauses lasting seconds, you need to tune your JVM heap. - Increase client timeout (carefully): You can increase your client’s socket timeout (e.g., from 10s to 60s), but this is a temporary fix. It just makes your application wait longer for a slow node.
- Optimize the query: If the query is just slow, optimize it! (See Blog #16). Use the Profile API to find the bottleneck.
- Check load balancer timeouts: If you use a load balancer (like an AWS ELB/ALB), check its “idle timeout” setting and ensure it’s higher than your longest-running expected queries.
What else can I do? If you suspect long GC pauses but aren’t sure how to read the gc.log, the OpenSearch community can help you analyze it. For deep-dives into JVM tuning, contact us in The OpenSearch Slack Channel in #General.