Resolving thread exhaustion and asyncio event loop delays under heavy IO loads Under Peak Event Traffic on OVH

Diagnosing Thread Exhaustion and Event Loop Delays on OVH Under Peak Load

This document outlines a systematic approach to diagnosing and resolving thread exhaustion and asyncio event loop delays, particularly when operating under heavy I/O loads during peak event traffic on OVH infrastructure. We will focus on practical, production-ready techniques and tools.

Identifying the Root Cause: System-Level Metrics

The first step is to establish a baseline and identify whether the bottleneck is at the OS level (thread exhaustion) or within the application’s event loop (asyncio delays). We’ll leverage standard Linux tools for this.

Monitoring Thread and Process Counts

Thread exhaustion is often indicated by a high number of threads, approaching or exceeding system limits. We can monitor this using top or htop, focusing on the ‘Th’ (threads) column. A more programmatic approach involves querying the process information directly.

Script for Real-time Thread Count Monitoring

This Bash script periodically checks the total number of threads for a specific process (identified by PID) and logs it. This is crucial for correlating spikes with application behavior.

#!/bin/bash

TARGET_PID=$1
INTERVAL_SECONDS=5
LOG_FILE="/var/log/thread_monitor_${TARGET_PID}.log"

if [ -z "$TARGET_PID" ]; then
  echo "Usage: $0 <PID>"
  exit 1
fi

echo "Monitoring threads for PID: $TARGET_PID every $INTERVAL_SECONDS seconds. Logging to $LOG_FILE"

while true; do
  THREAD_COUNT=$(ps -T -p $TARGET_PID | wc -l)
  TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
  echo "$TIMESTAMP - PID: $TARGET_PID, Threads: $(($THREAD_COUNT - 1))" >> $LOG_FILE
  sleep $INTERVAL_SECONDS
done

To use this, first find your application’s main process ID (e.g., using pgrep -f your_app_name) and then run the script: ./monitor_threads.sh <PID> &. Keep an eye on the log file during peak traffic.

Analyzing Event Loop Delays (Python asyncio)

For Python applications using asyncio, event loop delays are a primary indicator of I/O bound tasks blocking the loop. We can instrument the application to measure these delays.

Custom Asyncio Loop Policy for Delay Measurement

A common technique is to subclass the default event loop policy and override the call_exception_handler method to log when a callback takes too long to execute. A more direct approach is to measure the time spent in loop.run_forever() or loop.run_until_complete() and identify long-running callbacks.

import asyncio
import time
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class CustomEventLoop(asyncio.ProactorEventLoop): # Or asyncio.SelectorEventLoop on non-Windows
    def run_forever(self):
        self._running = True
        while self._running:
            self.run_once()

    def run_once(self):
        if not self._ready:
            self.idle()
        
        ready = self._ready
        self._ready = collections.deque()
        
        if ready:
            for callback, args in ready:
                start_time = time.perf_counter()
                try:
                    callback(*args)
                except Exception as e:
                    self.call_exception_handler({
                        'message': str(e),
                        'exception': e,
                        'future': None,
                    })
                finally:
                    end_time = time.perf_counter()
                    duration = end_time - start_time
                    if duration > 0.05: # Log if callback takes longer than 50ms
                        logging.warning(f"Long-running callback: {callback.__name__} took {duration:.4f}s")

# Usage:
# loop = CustomEventLoop()
# asyncio.set_event_loop(loop)
# loop.run_forever()

A more robust solution involves using libraries like aiomonitor or custom middleware that intercepts task execution and logs significant delays. For production, consider integrating with APM tools that support asyncio tracing.

Investigating Specific Bottlenecks

Once we’ve identified the general area of the problem (OS threads vs. event loop), we need to pinpoint the specific operations causing the strain.

High I/O Wait Times

High I/O wait times (%wa in top) indicate the CPU is idle, waiting for I/O operations to complete. This is a strong signal for network or disk bottlenecks.

Network I/O Analysis

On OVH, network saturation can occur due to traffic spikes. Tools like iftop, nethogs, and tcpdump are invaluable.

# Real-time bandwidth usage per connection
sudo apt-get update && sudo apt-get install -y iftop
sudo iftop -i eth0 # Replace eth0 with your primary network interface

# Bandwidth usage per process
sudo apt-get update && sudo apt-get install -y nethogs
sudo nethogs eth0 # Replace eth0 with your primary network interface

# Packet capture for deep inspection (e.g., identify slow responses)
sudo apt-get update && sudo apt-get install -y tcpdump
sudo tcpdump -i eth0 -s 0 -w /tmp/capture.pcap host <client_ip> and port <port_number>

Analyze the captured packets using Wireshark or tshark to identify excessive retransmissions, high latency, or slow application-level responses from external services.

Disk I/O Analysis

If disk I/O is the culprit, tools like iotop and iostat are essential.

# Real-time disk I/O usage per process
sudo apt-get update && sudo apt-get install -y iotop
sudo iotop -o

# Detailed disk I/O statistics
sudo apt-get update && sudo apt-get install -y sysstat
sudo iostat -xz 5 # Report extended statistics every 5 seconds

Look for high %util, high await times, and excessive read/write operations per second (r/s, w/s) on specific devices. This might point to database contention or inefficient file access patterns.

Blocking Operations in Asyncio

Even with asynchronous code, synchronous, blocking operations within an async function will halt the event loop. This is a common pitfall.

Identifying Blocking Calls

Use Python’s built-in profiling tools or specialized async profiling libraries. asyncio.get_running_loop().slow_callback_duration can be set to log callbacks exceeding a threshold.

import asyncio
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

async def my_async_task():
    # Simulate a blocking operation
    time.sleep(2) # This will block the event loop!
    logging.info("Task finished")

async def main():
    loop = asyncio.get_running_loop()
    loop.slow_callback_duration = 0.1 # Log callbacks longer than 100ms

    task = asyncio.create_task(my_async_task())
    await task

if __name__ == "__main__":
    asyncio.run(main())

The output will show a warning if time.sleep(2) exceeds the slow_callback_duration. The correct approach is to use asynchronous I/O libraries (e.g., aiohttp for HTTP, asyncpg for PostgreSQL) or run blocking operations in a separate thread pool using loop.run_in_executor().

Database Connection Pooling and Query Performance

Inefficient database queries or a lack of connection pooling can lead to significant delays. Ensure your database driver is asynchronous and that connection pools are properly configured and sized.

import asyncio
import asyncpg
import time

async def fetch_data():
    conn = None
    try:
        # Assuming a connection pool is managed elsewhere
        # For demonstration, we connect directly
        start_time = time.perf_counter()
        conn = await asyncpg.connect(user='user', password='password',
                                     database='database', host='db.example.com')
        
        # Slow query example
        query_start = time.perf_counter()
        rows = await conn.fetch("SELECT * FROM large_table WHERE some_condition = 'value'")
        query_duration = time.perf_counter() - query_start
        logging.info(f"Query executed in {query_duration:.4f}s, returned {len(rows)} rows.")

        # Simulate processing
        await asyncio.sleep(0.5) 

    except Exception as e:
        logging.error(f"Error: {e}")
    finally:
        if conn:
            await conn.close()
        end_time = time.perf_counter()
        logging.info(f"fetch_data took {end_time - start_time:.4f}s")

# In a real app, use a pool:
# async def get_pool():
#     return await asyncpg.create_pool(user='user', password='password',
#                                      database='database', host='db.example.com',
#                                      min_size=5, max_size=10)
#
# async def fetch_data_pooled(pool):
#     async with pool.acquire() as conn:
#         # ... query ...

Monitor query execution times directly in your database (e.g., using PostgreSQL’s pg_stat_statements) and optimize slow queries. Ensure your application is not holding database connections open longer than necessary.

Optimizing Resource Utilization on OVH

OVH instances, like any cloud provider, have finite resources. Effective configuration and tuning are key to handling peak loads.

Kernel Tuning (sysctl)

Adjusting kernel parameters can improve network throughput and reduce latency. Focus on TCP buffer sizes, connection tracking, and file descriptor limits.

# View current settings
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog
sysctl net.netfilter.nf_conntrack_max
sysctl fs.file-max

# Example tuning for high-traffic servers (apply with caution and test)
# Edit /etc/sysctl.conf or create a file in /etc/sysctl.d/
#
# Increase max connections backlog
net.core.somaxconn = 4096
# Increase TCP SYN backlog
net.ipv4.tcp_max_syn_backlog = 2048
# Increase connection tracking table size (adjust based on expected connections)
net.netfilter.nf_conntrack_max = 1000000
# Increase max file descriptors
fs.file-max = 200000

# Apply changes
sudo sysctl -p

Remember to also increase the per-process and system-wide file descriptor limits (e.g., in /etc/security/limits.conf) if you encounter “Too many open files” errors.

Application-Level Concurrency and Threading Models

The choice of concurrency model significantly impacts performance. For I/O-bound workloads, an event-driven, non-blocking model (like asyncio) is generally preferred over a thread-per-request model.

Gunicorn/Uvicorn Configuration for Python

When deploying Python web applications (e.g., Flask, FastAPI), the WSGI/ASGI server configuration is critical. For asyncio applications, Uvicorn is the standard. For traditional WSGI, Gunicorn is common.

# Uvicorn (for ASGI/asyncio apps like FastAPI)
# --workers: Number of worker processes. Typically (2 * num_cores) + 1.
# --loop uvloop: Use uvloop for better performance.
# --limit-concurrency: Max concurrent requests per worker (adjust based on I/O).
# --limit-max-requests: Restart worker after this many requests to prevent memory leaks.
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 --loop uvloop --limit-concurrency 1000 --limit-max-requests 5000

# Gunicorn (for WSGI apps)
# --workers: Number of worker processes.
# --threads: Number of threads per worker (for sync apps).
# --worker-connections: Max concurrent connections per worker (for async workers like gevent/eventlet).
# --timeout: Worker timeout.
# Use 'sync' worker class for CPU-bound, 'eventlet' or 'gevent' for I/O-bound sync apps.
# For pure async apps, Uvicorn is preferred.
gunicorn -w 4 -k sync myapp.wsgi:application --bind 0.0.0.0:8000 --timeout 120

For asyncio applications, focus on maximizing the --limit-concurrency in Uvicorn and ensuring your application code is truly non-blocking. If you have mixed sync/async code, consider running sync parts in executors.

Proactive Monitoring and Alerting

Reactive troubleshooting is insufficient for critical systems. Implement comprehensive monitoring and alerting.

Key Metrics to Monitor

System CPU Usage (overall and per-core)
System Load Average
Memory Usage (RAM and Swap)
Network I/O (bandwidth, packet loss, errors)
Disk I/O (IOPS, latency, throughput)
Process Thread Count
Application-specific metrics (request latency, error rates, queue lengths)
Asyncio Event Loop Latency (if applicable)

Alerting Strategies

Set up alerts for:

High CPU utilization (sustained > 80%)
High I/O wait times (> 15%)
Low available memory
High network error rates or packet loss
Sustained event loop delays (e.g., > 100ms)
Thread count approaching OS limits
Application error rate spikes

Tools like Prometheus with Alertmanager, Datadog, or Grafana Cloud provide robust solutions for metrics collection, visualization, and alerting. Ensure your monitoring agent is configured to collect the necessary system and application-level metrics.

Conclusion

Resolving thread exhaustion and event loop delays under heavy load requires a multi-faceted approach. Start with system-level diagnostics to pinpoint the bottleneck, then dive into application-specific code and configurations. Continuous monitoring and proactive tuning are essential for maintaining stability during peak traffic events on OVH or any cloud infrastructure.