How to Debug and Fix memory leaks and socket exhaustion in daemon processes in Modern C Applications

Diagnosing Memory Leaks in C Daemon Processes

Memory leaks in long-running C daemon processes are insidious. They manifest as a gradual increase in memory consumption, eventually leading to `OutOfMemory` errors, process termination by the OOM killer, or severe performance degradation due to excessive swapping. The root cause is typically dynamic memory allocation without corresponding deallocation. Identifying these leaks requires a systematic approach, often involving specialized tools.

A common first step is to monitor the process’s memory footprint. Tools like top, htop, or ps can provide a snapshot, but for tracking trends, a simple script can periodically log the Resident Set Size (RSS) and Virtual Memory Size (VMS) of the target daemon.

Systematic Memory Monitoring Script

This Bash script can be run via cron to log memory usage over time. Replace <PID> with the actual Process ID of your daemon.

First, find the PID. If your daemon has a PID file, use that. Otherwise, pgrep is your friend.

Finding the Daemon’s PID

# If you have a PID file
PID=$(cat /var/run/my_daemon.pid)

# Or using pgrep (be specific with the process name)
PID=$(pgrep -f "my_daemon_executable_name")

if [ -z "$PID" ]; then
    echo "Error: Daemon not found."
    exit 1
fi
echo "Monitoring PID: $PID"

Memory Logging Script

#!/bin/bash

DAEMON_NAME="my_daemon" # Or the executable name
LOG_FILE="/var/log/${DAEMON_NAME}_memory.log"
PID_FILE="/var/run/${DAEMON_NAME}.pid" # Optional, if your daemon creates one

# Try to get PID from PID file first, then fallback to pgrep
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ! ps -p "$PID" > /dev/null; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') - PID file exists but process $PID is not running. Attempting pgrep."
        PID=$(pgrep -f "${DAEMON_NAME}_executable_name") # Adjust this to be specific
    fi
elif [ -n "$1" ]; then # Allow PID as argument
    PID="$1"
else
    PID=$(pgrep -f "${DAEMON_NAME}_executable_name") # Adjust this to be specific
fi

if [ -z "$PID" ]; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - Error: Daemon process not found." >> "$LOG_FILE"
    exit 1
fi

# Get memory usage (RSS in KB, VMS in KB)
MEM_INFO=$(ps -p "$PID" -o rss=,vsz=)
RSS=$(echo "$MEM_INFO" | awk '{print $1}')
VMS=$(echo "$MEM_INFO" | awk '{print $2}')

# Log the information
echo "$(date '+%Y-%m-%d %H:%M:%S') - PID: $PID, RSS(KB): $RSS, VMS(KB): $VMS" >> "$LOG_FILE"

Schedule this script using cron. For example, to run it every minute:

* * * * * /path/to/your/memory_monitor.sh >> /var/log/memory_monitor_cron.log 2>&1

Analyze the ${DAEMON_NAME}_memory.log file. A steadily increasing RSS value, especially if it’s not tied to expected workload increases, strongly indicates a memory leak. The VMS might also increase, but RSS is usually the more critical metric for physical memory pressure.

Using Valgrind for Leak Detection

Valgrind is an indispensable tool for detecting memory management errors, including leaks. It works by instrumenting your executable, allowing it to track every memory allocation and deallocation. Running Valgrind on a daemon process can be tricky, as it significantly slows down execution. It’s best used in a development or staging environment, or on a production system during a low-traffic period or a controlled test.

Valgrind Memcheck Usage

The primary tool within Valgrind for memory leak detection is memcheck. To run your daemon under Valgrind, you typically execute it via the valgrind command.

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=/tmp/valgrind_output.log /path/to/your/daemon_executable [daemon_args]

Key Valgrind options:

--leak-check=full: Performs a thorough leak check.
--show-leak-kinds=all: Reports all types of leaks (definite, indirect, possible, reachable).
--track-origins=yes: Helps track where uninitialized values come from, which can be related to memory issues.
--verbose: Provides more detailed output.
--log-file=...: Directs output to a file, essential for long-running processes.

After running your daemon under Valgrind for a sufficient period (long enough for the leak to manifest), examine the /tmp/valgrind_output.log file. Valgrind will report memory blocks that were allocated but never freed. The output typically includes the size of the leaked block, the call stack at the time of allocation, and potentially the source file and line number.

Code-Level Debugging with GDB

Once Valgrind points to specific allocation sites, or if you suspect a leak in a particular code path, gdb (GNU Debugger) can be used for more granular inspection. Attaching gdb to a running process or starting the process under gdb allows you to set breakpoints, inspect memory, and step through code execution.

Attaching GDB to a Running Process

# Find the PID (as shown before)
PID=$(pgrep -f "my_daemon_executable_name")

# Attach gdb
gdb -p $PID

Once attached, you can set breakpoints. For example, if you suspect a leak in a function that handles incoming requests:

// Assuming your function is called process_request
break process_request

You can then continue execution (c) and when the breakpoint is hit, inspect variables, memory, and the call stack. To specifically look for leaks at a code level, you might instrument your own code with allocation/deallocation counters or use custom allocators that log allocations.

Custom Memory Allocator for Leak Tracking

For highly critical or complex applications, implementing a custom memory allocator can provide fine-grained control and detailed logging of memory operations. This involves replacing the standard malloc, free, and realloc with your own versions that track allocations.

Here’s a simplified example in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dlfcn.h> // For dlsym

// Structure to track allocations
typedef struct AllocationInfo {
    void *ptr;
    size_t size;
    const char *file;
    int line;
    struct AllocationInfo *next;
} AllocationInfo;

static AllocationInfo *alloc_list = NULL;
static size_t total_allocated = 0;
static size_t total_freed = 0;
static size_t current_allocations = 0;

// Function to get the caller's file and line number (simplified, real implementation is complex)
// This often requires compiler-specific intrinsics or linker tricks.
// For demonstration, we'll pass file/line explicitly.

// Replace malloc
void *my_malloc(size_t size, const char *file, int line) {
    void *ptr = malloc(size);
    if (!ptr) {
        perror("my_malloc failed");
        return NULL;
    }

    AllocationInfo *info = malloc(sizeof(AllocationInfo));
    if (!info) {
        free(ptr); // Clean up allocated memory if info allocation fails
        perror("my_malloc: Failed to allocate tracking info");
        return NULL;
    }

    info->ptr = ptr;
    info->size = size;
    info->file = file;
    info->line = line;
    info->next = alloc_list;
    alloc_list = info;

    total_allocated += size;
    current_allocations++;

    // Optional: Log allocation
    // fprintf(stderr, "ALLOC: %p, Size: %zu, File: %s, Line: %d\n", ptr, size, file, line);

    return ptr;
}

// Replace free
void my_free(void *ptr, const char *file, int line) {
    if (!ptr) return;

    AllocationInfo *current = alloc_list;
    AllocationInfo *prev = NULL;

    while (current) {
        if (current->ptr == ptr) {
            total_freed += current->size;
            current_allocations--;

            // Remove from list
            if (prev) {
                prev->next = current->next;
            } else {
                alloc_list = current->next;
            }

            // Optional: Log deallocation
            // fprintf(stderr, "FREE: %p, Size: %zu, File: %s, Line: %d\n", ptr, current->size, file, line);

            free(current); // Free the tracking info
            free(ptr);     // Free the actual memory
            return;
        }
        prev = current;
        current = current->next;
    }

    // If we reach here, it means free was called on a pointer not tracked or already freed
    fprintf(stderr, "FREE ERROR: Attempted to free untracked or double-freed pointer %p at %s:%d\n", ptr, file, line);
    // In a real scenario, you might want to abort or log this more severely.
    free(ptr); // Still attempt to free, though it's likely an error.
}

// Replace realloc
void *my_realloc(void *ptr, size_t new_size, const char *file, int line) {
    if (!ptr) {
        return my_malloc(new_size, file, line);
    }
    if (new_size == 0) {
        my_free(ptr, file, line);
        return NULL;
    }

    // Find the existing allocation info
    AllocationInfo *info = alloc_list;
    AllocationInfo *prev = NULL;
    while (info) {
        if (info->ptr == ptr) {
            break;
        }
        prev = info;
        info = info->next;
    }

    if (!info) {
        fprintf(stderr, "REALLOC ERROR: Reallocating untracked pointer %p at %s:%d\n", ptr, file, line);
        // Attempt to realloc anyway, but this is a sign of trouble.
        void *new_ptr = realloc(ptr, new_size);
        if (new_ptr) {
            // If realloc succeeded, we should ideally track it.
            // This is complex: if realloc moved the pointer, the old info is invalid.
            // For simplicity here, we'll assume it didn't move or we'll lose track.
            // A robust solution would update the pointer in AllocationInfo.
            total_allocated += (new_size - info->size); // Adjust total allocated
            info->size = new_size;
            return new_ptr;
        } else {
            return NULL; // realloc failed
        }
    }

    // Update total allocated/freed based on size change
    if (new_size > info->size) {
        total_allocated += (new_size - info->size);
    } else {
        total_freed += (info->size - new_size); // Effectively freeing the difference
    }
    info->size = new_size; // Update size in tracking info

    void *new_ptr = realloc(ptr, new_size);
    if (!new_ptr) {
        // Realloc failed. The original pointer 'ptr' is still valid.
        // We need to revert the size change in our tracking info.
        if (new_size > info->size) { // If we added to total_allocated
            total_allocated -= (new_size - info->size);
        } else { // If we subtracted from total_freed
            total_freed -= (info->size - new_size);
        }
        info->size = info->size; // Restore original size
        perror("my_realloc failed");
        return NULL;
    }

    // If realloc moved the pointer, update our tracking info
    if (new_ptr != ptr) {
        info->ptr = new_ptr;
    }

    return new_ptr;
}

// Function to report leaks at exit
void report_leaks() {
    fprintf(stderr, "\n--- Memory Leak Report ---\n");
    fprintf(stderr, "Total Allocated: %zu bytes\n", total_allocated);
    fprintf(stderr, "Total Freed:     %zu bytes\n", total_freed);
    fprintf(stderr, "Current Allocations: %zu\n", current_allocations);
    fprintf(stderr, "Remaining Leaked Bytes: %zu\n", total_allocated - total_freed);

    if (alloc_list) {
        fprintf(stderr, "\nLeaked Blocks:\n");
        AllocationInfo *current = alloc_list;
        while (current) {
            fprintf(stderr, "  - Leaked %zu bytes at %s:%d (allocated from %p)\n",
                    current->size, current->file, current->line, current->ptr);
            current = current->next;
        }
    } else {
        fprintf(stderr, "No memory leaks detected.\n");
    }
    fprintf(stderr, "--------------------------\n");
}

// Use macros to automatically capture file and line
#define malloc(size) my_malloc(size, __FILE__, __LINE__)
#define free(ptr) my_free(ptr, __FILE__, __LINE__)
#define realloc(ptr, size) my_realloc(ptr, size, __FILE__, __LINE__)

// Ensure report_leaks is called at program exit
// This requires using a constructor function or atexit()
__attribute__((constructor))
static void init_memory_tracker() {
    // Optional: Initialize any other tracking structures here
    atexit(report_leaks); // Register report_leaks to be called on normal program exit
}

// Example usage in your daemon code:
/*
void process_data() {
    char *buffer = malloc(1024); // Will call my_malloc
    if (!buffer) return;
    // ... use buffer ...
    // If you forget free(buffer); it will be reported as a leak.
}
*/

To use this, you’d compile your daemon with this code included or linked, and ensure the macros override the standard library functions. The __attribute__((constructor)) ensures init_memory_tracker runs before main, and atexit(report_leaks) registers the leak reporting function to be called when the program exits normally. This custom allocator provides detailed information about where leaked memory was allocated.

Addressing Socket Exhaustion in Daemon Processes

Socket exhaustion occurs when a process or the system runs out of available network sockets. This can happen due to several reasons:

Rapidly opening and closing sockets without proper cleanup, leaving them in a TIME_WAIT state.
Holding onto sockets longer than necessary.
Insufficient system-wide limits on the number of open file descriptors (sockets are file descriptors).
Resource leaks within the application that prevent sockets from being closed.

Monitoring Socket Usage

The first step is to understand how many sockets your daemon is using and in what state they are. The ss command is the modern replacement for netstat and provides detailed socket statistics.

Using `ss` to Inspect Sockets

# Show all TCP sockets for a specific process (replace <PID>)
ss -tpn | grep <PID>

# Show all listening sockets for a specific process
ss -ltnp | grep <PID>

# Show all sockets in TIME_WAIT state for a specific process
ss -tanp | grep <PID> | grep TIME-WAIT

# Count sockets by state for a specific process
ss -tanp | grep <PID> | awk '{print $1}' | sort | uniq -c

A high number of sockets in TIME_WAIT state, especially if they are short-lived connections, can indicate that the system’s TCP/IP stack is not reusing them quickly enough. This is often configurable at the kernel level.

System-Wide File Descriptor Limits

Daemons often require a large number of file descriptors. If the system-wide or per-process limits are too low, this can lead to socket exhaustion even if the application is managing its resources correctly. These limits are controlled via ulimit and system configuration files.

Checking and Setting Limits

# Check current limits for the running shell
ulimit -n

# Check limits for a specific process (requires root or appropriate privileges)
# Find PID first
PID=$(pgrep -f "my_daemon_executable_name")
cat /proc/$PID/limits | grep "Max open files"

To permanently increase these limits, you typically edit /etc/security/limits.conf or files in /etc/security/limits.d/. For systemd services, limits are often set within the service unit file.

Configuring Systemd Service Limits

[Unit]
Description=My Daemon Service

[Service]
ExecStart=/path/to/my_daemon_executable
User=myuser
Group=mygroup
# Set file descriptor limit for this service
LimitNOFILE=65536
# Optional: Set memory lock limit if needed
# LimitMEMLOCK=infinity

[Install]
WantedBy=multi-user.target

After modifying the systemd unit file, reload the systemd daemon and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart my_daemon.service

Kernel Tuning for TCP/IP Performance

If you observe a high number of sockets in TIME_WAIT, tuning the kernel’s TCP/IP stack can help. These parameters are typically set in /etc/sysctl.conf or files in /etc/sysctl.d/.

Relevant sysctl Parameters

# Enable fast reuse of TIME_WAIT sockets
net.ipv4.tcp_tw_reuse = 1

# Enable recycling of sockets in TIME_WAIT state (use with caution, can break some protocols)
# net.ipv4.tcp_tw_recycle = 1 # Generally discouraged on modern systems, especially behind NAT

# Increase the maximum number of sockets the kernel can allocate
net.core.somaxconn = 4096 # Adjust based on expected load

# Increase the backlog queue size for listening sockets
net.ipv4.tcp_max_syn_backlog = 2048 # Adjust based on expected load

# Increase the maximum number of TCP sockets
net.ipv4.tcp_max_tw_buckets = 180000 # Default is often 180000, adjust if needed

# Increase the maximum number of file descriptors the kernel can handle
fs.file-max = 2097152 # A high system-wide limit

Apply these changes with:

sudo sysctl -p

Application-Level Socket Management

Even with system tuning, the application itself must manage sockets efficiently. This involves:

Ensuring sockets are closed promptly when no longer needed.
Using non-blocking I/O and event notification mechanisms (like epoll, kqueue, or select/poll) to avoid blocking threads on socket operations.
Implementing connection pooling or reusing existing connections where appropriate.
Handling socket errors gracefully and cleaning up associated resources.

Example: Non-blocking Sockets with epoll (Conceptual C)

#include <sys/epoll.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>

// ... (assume socket creation and binding is done) ...

int epoll_fd = epoll_create1(0);
if (epoll_fd == -1) {
    perror("epoll_create1");
    // Handle error
}

struct epoll_event event;
// Add listening socket to epoll
event.events = EPOLLIN;
event.data.fd = listen_socket_fd;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_socket_fd, &event) == -1) {
    perror("epoll_ctl: listen_socket_fd");
    // Handle error
}

struct epoll_event events[MAX_EVENTS]; // MAX_EVENTS defined elsewhere

while (1) {
    int num_events = epoll_wait(epoll_fd, events, MAX_EVENTS, -1); // -1 for infinite timeout
    if (num_events == -1) {
        if (errno == EINTR) continue; // Interrupted by signal, retry
        perror("epoll_wait");
        // Handle error
        break;
    }

    for (int i = 0; i < num_events; ++i) {
        int current_fd = events[i].data.fd;

        if (current_fd == listen_socket_fd) {
            // New connection
            struct sockaddr_in client_addr;
            socklen_t client_len = sizeof(client_addr);
            int client_fd = accept(listen_socket_fd, (struct sockaddr*)&client_addr, &client_len);
            if (client_fd == -1) {
                if (errno == EAGAIN || errno == EWOULDBLOCK) {
                    // No more connections to accept right now
                    continue;
                }
                perror("accept");
                // Handle error
                continue;
            }

            // Make the client socket non-blocking
            if (fcntl(client_fd, F_SETFL, O_NONBLOCK) == -1) {
                perror("fcntl F_SETFL O_NONBLOCK");
                close(client_fd); // Close if we can't make it non-blocking
                continue;
            }

            // Add client socket to epoll
            event.events = EPOLLIN | EPOLLET; // EPOLLIN for data, EPOLLET for edge-triggered
            event.data.fd = client_fd;
            if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_fd, &event) == -1) {
                perror("epoll_ctl: client_fd");
                close(client_fd);
                continue;
            }
        } else if (events[i].events & EPOLLIN) {
            // Data received on a client socket
            char buffer[BUFFER_SIZE];
            ssize_t bytes_read = read(current_fd, buffer, sizeof(buffer) - 1);

            if (bytes_read == -1) {
                if (errno == EAGAIN || errno == EWOULDBLOCK) {
                    // No data available right now, this can happen with edge-triggered
                    continue;
                }
                perror("read");
                // Error reading, assume connection closed or error
                epoll_ctl(epoll_fd, EPOLL_CTL_DEL, current_fd, NULL);
                close(current_fd);
            } else if (bytes_read == 0) {
                // Connection closed by peer
                epoll_ctl(epoll_fd, EPOLL_CTL_DEL, current_fd, NULL);
                close(current_fd);
            } else {
                // Process received data (buffer, bytes_read)
                buffer[bytes_read] = '\0';
                printf("Received: %s\n", buffer);
                // Example: Echo back
                // write(current_fd, buffer, bytes_read);
            }
        }
        // Handle other events like EPOLLOUT, EPOLLERR, etc.
    }
}

// Cleanup: close(listen_socket_fd); close(epoll_fd);

By using non-blocking I/O and an event loop, the daemon can manage thousands of connections efficiently with a small number of threads, significantly reducing the likelihood of socket exhaustion and improving overall responsiveness.