Step-by-Step: Diagnosing Segmentation Fault (core dumped) in multi-threaded C/C++ daemons on DigitalOcean Servers

Enabling Core Dumps on DigitalOcean

Segmentation faults in multi-threaded C/C++ daemons are notoriously difficult to debug, especially in production environments. The first hurdle is ensuring that core dumps are actually generated when a crash occurs. By default, many Linux distributions, including those commonly used on DigitalOcean, have core dump generation disabled or severely restricted. We need to explicitly enable it and configure its location and size limits.

On a DigitalOcean droplet, you can check the current limits using the ulimit command. For persistent changes across reboots, we’ll modify system configuration files.

Checking Current Limits

Log into your DigitalOcean server via SSH and run the following commands to see the current resource limits, specifically for core file size:

ulimit -c

If this outputs 0, core dumps are disabled. If it outputs a number, that’s the maximum size in kilobytes. To enable unlimited core dump size (or a sufficiently large one), we need to adjust the system’s configuration.

Configuring System-Wide Limits

The most robust way to ensure core dumps are enabled is by modifying the /etc/security/limits.conf file. This file controls the system-wide resource limits for users and groups. We’ll add entries to set the core file size limit for all users.

Edit the file with root privileges:

sudo nano /etc/security/limits.conf

Add the following lines to the end of the file. The first line sets the core file size limit to unlimited for all users, and the second line ensures this applies to processes started by root as well.

*   soft    core    unlimited
*   hard    core    unlimited
root  soft    core    unlimited
root  hard    core    unlimited

After saving the file, these changes will take effect for new login sessions. For currently running processes, you might need to restart them or, in some cases, reboot the server for the new limits to be fully applied. It’s also good practice to configure where core dumps are saved.

Configuring Core Dump Location and Naming

By default, core dumps are often saved in the current working directory of the crashing process. This can be problematic if the daemon runs from a directory with limited write permissions or if you have multiple instances. We can use /proc/sys/kernel/core_pattern to control this.

First, check the current pattern:

cat /proc/sys/kernel/core_pattern

If it starts with a pipe (|), it means core dumps are being piped to a handler program (like systemd-coredump). For debugging with tools like gdb, it’s often easier to have the core file dumped directly. We can set it to dump to a specific directory with a descriptive filename.

To set a pattern that includes the PID, timestamp, and username, and to save it in a dedicated directory (e.g., /var/crash/cores), you can use the following commands. Ensure the directory exists and has appropriate write permissions.

sudo mkdir -p /var/crash/cores
sudo chmod 777 /var/crash/cores # Adjust permissions as needed for security
echo '/var/crash/cores/core.%e.%p.%t' | sudo tee /proc/sys/kernel/core_pattern

The pattern %e is the executable filename, %p is the PID, and %t is the timestamp. To make this persistent across reboots, add the following line to /etc/sysctl.conf:

kernel.core_pattern = /var/crash/cores/core.%e.%p.%t

Then, apply the changes immediately:

sudo sysctl -p

Reproducing the Segmentation Fault

Once core dumps are enabled, the next step is to trigger the segmentation fault in a controlled manner to capture the core file. This often involves identifying the specific user actions, API calls, or data inputs that lead to the crash.

Simulating Load and Edge Cases

For multi-threaded daemons, race conditions are a common cause of segmentation faults. These are difficult to reproduce consistently. Tools like stress-ng can be useful for generating high CPU, memory, or I/O load, which might expose latent race conditions.

# Example: Stress CPU and I/O
sudo apt-get update && sudo apt-get install -y stress-ng
stress-ng --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 60s

Alternatively, if you have identified specific API endpoints or data structures that are prone to issues, you can write custom scripts to bombard your daemon with requests or malformed data. For instance, if your daemon processes JSON payloads:

import requests
import json
import threading
import time

URL = "http://your-daemon-api-endpoint"

def send_request(payload):
    try:
        response = requests.post(URL, json=payload)
        # print(f"Sent: {payload}, Response: {response.status_code}")
    except Exception as e:
        print(f"Error sending request: {e}")

# Example of potentially problematic payloads
malformed_payloads = [
    {"key": "value", "nested": {"array": [1, 2, 3]}},
    {"key": "value", "nested": {"array": "not_an_array"}}, # Type mismatch
    {"key": "value", "nested": None}, # Null nested object
    {"key": "value", "long_string": "A" * 1000000}, # Large data
    # Add more edge cases specific to your daemon's logic
]

threads = []
for _ in range(20): # Number of concurrent threads
    for payload in malformed_payloads:
        t = threading.Thread(target=send_request, args=(payload,))
        threads.append(t)
        t.start()
        time.sleep(0.01) # Small delay to avoid overwhelming

for t in threads:
    t.join()

print("Finished sending requests.")

Run this script while your daemon is active. Monitor its logs and the core dump directory for the appearance of a core file.

Analyzing the Core Dump with GDB

Once a core file is generated (e.g., /var/crash/cores/core.mydaemon.12345.1678886400), you can use the GNU Debugger (GDB) to inspect it. You’ll need the executable binary that generated the core dump, ideally compiled with debugging symbols (-g flag).

Setting up GDB

If GDB is not installed, install it:

sudo apt-get update && sudo apt-get install -y gdb

To analyze the core dump, you need both the executable and the core file. Navigate to the directory containing your daemon’s executable (or ensure it’s in your PATH) and run GDB:

# Assuming your daemon executable is 'mydaemon' and core file is in /var/crash/cores/
gdb /path/to/your/mydaemon /var/crash/cores/core.mydaemon.12345.1678886400

Essential GDB Commands for Core Analysis

Once GDB loads, you’ll be presented with a (gdb) prompt. Here are the most critical commands:

bt (or backtrace): Displays the call stack of the thread that caused the crash. This is usually the first command to run.
bt full: Shows the call stack with local variables for each frame.
info threads: Lists all threads in the process.
thread apply all bt: Executes the bt command for all threads. This is crucial for multi-threaded applications to see where other threads were.
frame [N]: Switches to stack frame number N.
info locals: Displays local variables in the current frame.
p [variable_name]: Prints the value of a variable.
p/x [variable_name]: Prints the value of a variable in hexadecimal.
disassemble: Shows the assembly code around the current instruction pointer.
list: Shows the source code around the current line.
quit: Exits GDB.

Example GDB Session Snippet:

(gdb) bt
#0  0x00007f1234567890 in some_function (arg1=0x10, arg2=0x0) at my_source.c:150
#1  0x00007f1234567abc in another_function (data=0x7ffc12345678) at my_source.c:205
#2  0x00007f1234567def in thread_worker (arg=0x0) at my_source.c:300
#3  0x00007f1234567f01 in start_thread (arg=0x7f1234567890) at pthread_create.c:312
#4  0x00007f1234567a23 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:101

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f1234567890 (LWP 12345) "mydaemon" 0x00007f1234567890 in some_function (arg1=0x10, arg2=0x0) at my_source.c:150
  2    Thread 0x7f1234567900 (LWP 12346) "mydaemon" 0x00007f1234567bcd in some_other_thread_func () at other_source.c:50
  3    Thread 0x7f1234567910 (LWP 12347) "mydaemon" 0x00007f1234567cde in another_thread_func () at yet_another_source.c:75

(gdb) thread apply all bt
(gdb) p arg1
$1 = 16
(gdb) p arg2
$2 = (void *) 0x0
(gdb) frame 1
#1  0x00007f1234567abc in another_function (data=0x7ffc12345678) at my_source.c:205
(gdb) info locals
data = 0x7ffc12345678
(gdb) p *data
$3 = { member1 = 10, member2 = "some_string" }

The key is to identify the exact line of code causing the fault (often indicated by a signal 11, SIGSEGV) and examine the state of variables and other threads at that precise moment. Look for null pointers, out-of-bounds array accesses, or corrupted data structures.

Advanced Debugging Techniques

For complex multi-threaded issues, especially race conditions, core dumps might not always provide enough context. Consider these additional tools and strategies.

Valgrind and AddressSanitizer

While Valgrind is excellent for detecting memory errors (like use-after-free, buffer overflows) and race conditions, running it on a production daemon can be prohibitively slow. However, it’s invaluable during development or staging. AddressSanitizer (ASan) is a faster alternative that can be compiled into your application.

Compile your C/C++ code with ASan enabled:

# For GCC/Clang
g++ -fsanitize=address -g my_source.cpp -o mydaemon -pthread

When a program compiled with ASan crashes with a segmentation fault, it often provides a much more detailed report than a standard core dump, pinpointing the exact memory access violation and its history.

ThreadSanitizer (TSan)

For race conditions specifically, ThreadSanitizer is the go-to tool. It instruments your code to detect data races at runtime. Like ASan, it requires recompilation.

# For GCC/Clang
g++ -fsanitize=thread -g my_source.cpp -o mydaemon -pthread

TSan is more performance-intensive than ASan but can effectively find subtle bugs in multi-threaded code that are hard to reproduce with core dumps alone.

Logging and Assertions

Robust logging is your best friend. Ensure your daemon logs critical events, thread starts/stops, and any data processing steps. Use assert() statements liberally during development to catch invalid states early. In production, consider using conditional compilation (e.g., #ifdef DEBUG) to enable more verbose logging or assertions only when needed.

For multi-threaded applications, logging needs to be thread-safe. Using a thread-safe logging library or a mutex around standard output can prevent interleaved log messages.

Remote Debugging with GDB Server

If you can reproduce the crash in a staging environment or even locally, you can attach GDB remotely to a running process or debug a core dump generated on a remote machine. This involves running gdbserver on the target machine and connecting to it from your development machine.

On the server:

# Compile with -g and -rdynamic for better symbol visibility
g++ -g -rdynamic my_source.cpp -o mydaemon -pthread

# Start gdbserver, attach to a running process (PID 12345)
gdbserver :1234 /proc/12345/exe --attach
# Or, to run and debug from the start
# gdbserver :1234 ./mydaemon

On your development machine:

# Connect to the gdbserver
gdb ./mydaemon
(gdb) target remote your_server_ip:1234

This allows you to set breakpoints, inspect memory, and step through code on the remote server in real-time, which is invaluable for debugging elusive multi-threaded issues.