Step-by-Step: Diagnosing memory leaks and socket exhaustion in daemon processes on AWS Servers

Initial Triage: Identifying the Symptoms

When daemon processes on AWS servers exhibit erratic behavior, often manifesting as slow response times, unresponsiveness, or outright crashes, the first suspects are typically memory leaks and socket exhaustion. These issues are insidious, growing over time and eventually crippling application performance and availability. This guide provides a step-by-step approach to diagnosing and resolving these common problems.

The initial triage involves correlating observed symptoms with system resource utilization. High memory usage, particularly by a specific process, and a rapidly increasing number of open file descriptors (which includes sockets) are strong indicators. We’ll start by gathering baseline metrics and then dive into specific diagnostic tools.

Monitoring Memory Usage

Before diving deep, let’s establish a clear picture of memory consumption. We’ll use standard Linux tools to inspect memory usage of our target daemon process. Assume your daemon process has a known PID, or you can identify it using tools like pgrep or ps aux | grep [process_name].

Identifying the Process ID (PID)

If you don’t have the PID readily available, use pgrep. For example, to find the PID of a process named ‘my_daemon’:

pgrep my_daemon

This will output one or more PIDs. If multiple PIDs are returned, you might need to refine your search or inspect the process tree to identify the correct one.

Inspecting Memory with `top` and `ps`

The top command provides a dynamic, real-time view of running processes. To focus on your daemon, you can filter it. First, get the PID (let’s assume it’s 12345):

top -p 12345

Observe the VIRT (Virtual Memory Size), RES (Resident Set Size), and %MEM columns. A steadily increasing RES value over time, even when the process is idle, is a classic sign of a memory leak. VIRT can also increase, but RES is a more direct indicator of physical RAM consumption.

For a snapshot, ps is invaluable. To get detailed memory information for PID 12345:

ps -o pid,ppid,cmd,%mem,rss,vsz -p 12345

Here:

%mem: Percentage of physical memory used.
rss: Resident Set Size (in kilobytes).
vsz: Virtual Memory Size (in kilobytes).

To track memory growth over time, you can script this:

while true; do ps -o pid,cmd,%mem,rss -p 12345; sleep 60; done > memory_log.txt

Analyze memory_log.txt for trends. A consistent upward trend in rss is a strong indicator of a leak.

Diagnosing Memory Leaks with Application-Level Tools

While system tools show *that* memory is increasing, application-level tools help pinpoint *where* it’s being allocated and not freed. The specific tools depend heavily on the programming language of your daemon.

PHP Example: Using Xdebug and Cachegrind

For PHP daemons (e.g., using libraries like ReactPHP or Swoole), Xdebug can be configured to generate profiling data. Enable the xdebug.profiler_enable_trigger and xdebug.profiler_output_dir settings in your php.ini.

[xdebug]
xdebug.mode = profile
xdebug.start_with_request = yes
xdebug.profiler_enable_trigger = 1
xdebug.profiler_output_dir = /tmp/xdebug_profiles

Then, trigger profiling for a specific request or a period of operation. You can often do this by setting a cookie or a GET/POST parameter (e.g., XDEBUG_PROFILE=1). The output files (cachegrind.out.*) can be analyzed with tools like KCacheGrind (Linux/macOS) or Webgrind (web-based).

Look for functions that are called frequently and consume significant memory, especially those that might be accumulating data in global arrays or objects without proper cleanup. If your daemon runs continuously, you might need to periodically trigger profiling and analyze the diffs between profiles to identify growth.

Python Example: Using `objgraph` and `memory_profiler`

For Python daemons, memory_profiler is excellent for line-by-line memory usage analysis. Install it:

pip install memory_profiler

Decorate your functions with @profile and run your script using mprof run your_script.py. Then, visualize the results with mprof plot.

objgraph is powerful for visualizing object references and finding reference cycles or unexpected object growth. Install it:

pip install objgraph

In your application, you can periodically dump object counts:

import objgraph
import gc

# ... inside your daemon's loop or a periodic task ...
gc.collect() # Force garbage collection to get a cleaner snapshot
print(objgraph.count('MyLeakyObject'))
print(objgraph.show_most_common_types(limit=20))
# For more detailed analysis, consider objgraph.show_backrefs()

Monitor the output of these calls over time. A steadily increasing count for a specific object type, especially one that shouldn’t be accumulating indefinitely, points to a leak.

Diagnosing Socket Exhaustion

Socket exhaustion occurs when a process opens too many network sockets and exceeds the operating system’s limit for file descriptors. This can prevent new connections from being established, leading to application failures.

Checking File Descriptor Limits

First, understand the limits. The system-wide limit and per-process limits are crucial. Check the system-wide limit:

cat /proc/sys/fs/file-max

Check the limits for your specific process (using PID 12345):

cat /proc/12345/limits

Look for “Max open files”. If the process is hitting this limit, you’ll see a high number of open file descriptors.

Counting Open File Descriptors

The lsof (list open files) command is your primary tool here. To count all open file descriptors for PID 12345:

lsof -p 12345 | wc -l

To specifically count network sockets (TCP and UDP):

lsof -p 12345 -i | wc -l

A rapidly increasing count here, especially when the application is expected to be idle or handling a stable load, indicates a problem with socket management.

To monitor this over time:

while true; do echo "$(date): $(lsof -p 12345 -i | wc -l)"; sleep 60; done > socket_log.txt

Analyzing Socket Usage with `netstat` and `ss`

netstat (though often deprecated in favor of ss) and ss provide detailed network connection information. To see all connections for PID 12345:

sudo netstat -tulnp | grep 12345

sudo ss -tulnp | grep 12345

Look for a large number of sockets in states like CLOSE_WAIT or ESTABLISHED that are not being properly closed. A high number of CLOSE_WAIT sockets often indicates that the application has received a FIN packet from the remote end but hasn’t closed its own socket, suggesting a resource leak or a bug in the application’s shutdown logic.

Application-Level Socket Leak Diagnosis

Similar to memory leaks, socket leaks are often bugs within the application code. The key is to identify where sockets are being opened but not closed.

Python Example: Tracking Socket Objects

You can use objgraph again to track socket objects. If your application uses standard library sockets or libraries like requests, you can monitor the count of socket.socket objects.

import socket
import objgraph
import gc
import time

# ... inside your daemon's loop or a periodic task ...
gc.collect()
socket_count = objgraph.count('socket.socket')
print(f"Current socket count: {socket_count}")
# If you suspect a specific type of socket (e.g., from a library)
# print(objgraph.count('requests.packages.urllib3.connectionpool.HTTPConnection'))

If the count of socket.socket objects grows unboundedly, it signifies a leak. You’ll need to trace the code path that opens these sockets to find where they are not being closed (e.g., missing .close() calls, unhandled exceptions preventing cleanup).

Node.js Example: Monitoring Network Events

In Node.js, unclosed network resources (like HTTP clients or server sockets) can lead to leaks. Tools like the built-in process.memoryUsage() can show overall memory growth, but for specific socket leaks, you might need to instrument your code.

const http = require('http');
const server = http.createServer((req, res) => {
  // Example: A leaky client that doesn't close connections
  const client = http.request({ host: 'example.com', port: 80, method: 'GET' }, (clientRes) => {
    clientRes.on('data', () => {}); // Consume data
    clientRes.on('end', () => {
      // Missing client.destroy() or client.end() here can cause leaks
      console.log('Client response received');
    });
  });
  client.on('error', (err) => {
    console.error('Client error:', err);
    // Ensure client is destroyed even on error
    client.destroy();
  });
  client.end(); // End the request

  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.end('Hello World\n');
});

server.listen(8080, () => {
  console.log('Server listening on 8080');
});

// Periodically check open handles (includes sockets)
setInterval(() => {
  const activeHandles = process._getActiveHandles();
  console.log(`Active handles: ${activeHandles.length}`);
  // You might need to inspect activeHandles to identify specific leaking socket types
}, 5000);

The process._getActiveHandles() method can reveal the number of active network handles. If this number grows continuously, you need to examine the code that initiates network requests or listens on ports to ensure proper cleanup (e.g., calling .destroy() or .end() on client sockets, and handling server socket events correctly).

AWS Specific Considerations

On AWS, consider the context of your daemon. Is it running on EC2, ECS, or EKS? Each has implications:

EC2 Instances: Standard Linux/Windows troubleshooting applies. Ensure your EC2 instance type has sufficient RAM and that OS-level limits (e.g., ulimit) are appropriately configured, especially for high-traffic applications.
ECS/EKS Containers: Container resource limits (CPU, memory) are critical. A memory leak within a container can cause the container to be OOM-killed by the orchestrator. Socket exhaustion within a container can affect the container’s ability to communicate, and potentially impact the host if not properly contained. Monitor container metrics via CloudWatch Container Insights.
Network Configuration: Security Groups and Network ACLs can indirectly affect socket behavior by dropping packets, but they typically don’t cause leaks themselves unless your application logic incorrectly retries connections indefinitely upon transient network errors.

Resolution Strategies

Once the root cause is identified:

Memory Leaks:
- Fix the code: Ensure all allocated memory is freed, objects are garbage collected, and resources like file handles or network connections are closed.
- Review data structures: Avoid unbounded growth of collections. Implement pagination, caching with eviction policies, or periodic pruning.
- Use language-specific memory management tools and best practices.
Socket Exhaustion:
- Fix the code: Ensure all opened sockets are explicitly closed, especially in error handling paths. Use try-finally blocks or context managers (like Python’s with statement) to guarantee cleanup.
- Increase limits: If legitimate usage requires more file descriptors, adjust ulimit settings for the daemon’s user or system-wide limits (/etc/security/limits.conf, /proc/sys/fs/file-max). For containers, this might involve adjusting the container runtime configuration or orchestrator settings.
- Connection pooling: For outgoing connections, use connection pooling to reuse existing connections rather than opening new ones for every request.
- Tune keep-alive settings: For incoming connections, ensure appropriate keep-alive timeouts are set to close idle connections.

Regularly review your application’s resource consumption and implement robust monitoring and alerting to catch these issues before they impact production systems.