Troubleshooting Nginx ‘502 Bad Gateway’ and ‘504 Gateway Timeout’ errors: Diagnosing upstream sock allocations on Rocky Linux 9

Understanding Nginx 502 and 504 Errors

The ‘502 Bad Gateway’ and ‘504 Gateway Timeout’ errors in Nginx are symptomatic of upstream communication failures. A 502 indicates that Nginx, acting as a reverse proxy, received an invalid response from an upstream server (e.g., a PHP-FPM process, a Node.js application, or another web service). A 504, conversely, signifies that Nginx did not receive a timely response from the upstream server within its configured timeout period.

While these errors point to upstream issues, the root cause can often lie in the network configuration, resource exhaustion on the upstream server, or even misconfigurations within Nginx itself, particularly concerning how it manages connections to these upstream services. This post focuses on diagnosing these issues on Rocky Linux 9, with a specific emphasis on the underlying socket allocation and network stack behavior.

Initial Diagnostic Steps: Nginx and System Logs

The first line of defense is always log analysis. On Rocky Linux 9, Nginx logs are typically found in /var/log/nginx/. We’ll be looking at both the error.log and potentially the access.log.

Nginx Error Log Analysis

A common pattern for 502/504 errors related to upstream communication will manifest in the Nginx error log. We’ll use journalctl to tail these logs, as Rocky Linux 9 uses systemd.

Tail the Nginx error log in real-time:

sudo journalctl -u nginx -f -o cat

Look for entries similar to these:

[error] 12345#12345: *678 connect() to unix:/run/php-fpm/www.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 192.168.1.100, server: example.com, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock", host: "example.com"
[error] 12345#12345: *679 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.1.101, server: example.com, request: "POST /api/data HTTP/1.1", upstream: "http://127.0.0.1:8080/api/data", host: "example.com"

The first example (error code 11: Resource temporarily unavailable) often points to issues with the upstream process (like PHP-FPM) not being able to accept new connections, frequently due to a full worker pool or socket issues. The second example (error code 110: Connection timed out) clearly indicates a timeout on the upstream connection.

Upstream Service Logs

It’s crucial to check the logs of the actual upstream service. For PHP-FPM, this is typically in /var/log/php-fpm/www-error.log (or similar, depending on your FPM pool configuration). For other services, consult their specific logging locations.

sudo tail -f /var/log/php-fpm/www-error.log

Look for errors indicating worker exhaustion, segmentation faults, or other critical failures within the upstream application.

Diagnosing Socket Allocation and Resource Issues

The Nginx error mentioning “Resource temporarily unavailable” (errno 11) when connecting to a Unix socket (like /run/php-fpm/www.sock) is a strong indicator of problems at the operating system level or within the upstream service’s ability to manage its sockets.

PHP-FPM Worker Pool Exhaustion

PHP-FPM has a configurable number of worker processes. If the number of incoming requests exceeds the available workers, new connections will be refused, leading to the “Resource temporarily unavailable” error. This is controlled by the pm.max_children directive in your PHP-FPM pool configuration file (e.g., /etc/php-fpm.d/www.conf).

[global]
pid = /run/php-fpm/php-fpm.pid
error_log = /var/log/php-fpm/error.log
log_level = notice

[www]
user = nginx
group = nginx
listen = /run/php-fpm/www.sock
listen.owner = nginx
listen.group = nginx
listen.mode = 0660
pm = dynamic
pm.max_children = 50  <-- Tune this value
pm.start_servers = 5
pm.min_spare_servers = 2
pm.max_spare_servers = 10
pm.process_idle_timeout = 10s
request_terminate_timeout = 60s

To diagnose this, monitor the number of active PHP-FPM processes. You can do this using ps or by leveraging PHP-FPM’s status page if enabled.

ps aux | grep "php-fpm" | grep -v "grep" | wc -l

If this count consistently approaches or exceeds pm.max_children during peak load, you need to increase pm.max_children. However, be mindful of server memory. Each PHP-FPM worker consumes memory. A common approach is to set pm.max_children to a value that, when multiplied by the average memory footprint of a PHP-FPM worker, stays within your available RAM, leaving room for the OS and Nginx.

Unix Socket Permissions and State

The Nginx configuration must correctly reference the PHP-FPM socket, and the socket must be accessible. Ensure the listen.owner, listen.group, and listen.mode in the PHP-FPM pool configuration align with the user/group Nginx runs as (typically nginx on Rocky Linux 9) and that Nginx has read/write permissions.

# Nginx configuration snippet (e.g., in /etc/nginx/conf.d/your_app.conf)
location ~ \.php$ {
    try_files $uri =404;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass unix:/run/php-fpm/www.sock;  <-- Ensure this matches PHP-FPM config
    fastcgi_index index.php;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    include fastcgi_params;
}

Verify the socket’s existence and permissions:

ls -l /run/php-fpm/www.sock

If the socket is missing, restart PHP-FPM: sudo systemctl restart php-fpm. If permissions are incorrect, adjust them in the PHP-FPM pool configuration and restart PHP-FPM.

TCP Socket Issues (for HTTP Upstreams)

If your upstream is a TCP service (e.g., a Node.js app on port 8080), the Nginx configuration will use an IP address and port.

location / {
    proxy_pass http://127.0.0.1:8080;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}

The ‘504 Gateway Timeout’ error here often means the upstream application is slow to respond or is overloaded. Use tools like netstat or ss to check the state of connections on the upstream port.

sudo ss -tulnp | grep ':8080'

This will show if the upstream service is listening on the expected port. If the upstream application is crashing or not starting, Nginx won’t be able to connect, leading to a 502. Check the upstream application’s logs for startup errors.

Nginx Timeout and Buffer Settings

While the root cause is often upstream, Nginx’s own timeout and buffer settings can exacerbate or mask the problem, or even be the direct cause of a 504 if the upstream is slow but functional.

Key Nginx Directives

proxy_connect_timeout: Timeout for establishing a connection with the upstream server.
proxy_send_timeout: Timeout for transmitting a request to the upstream server.
proxy_read_timeout: Timeout for reading the response from the upstream server. This is often the most relevant for 504 errors.
fastcgi_read_timeout: Similar to proxy_read_timeout but for FastCGI.
proxy_buffers and proxy_buffer_size: Control the buffering of responses from the upstream. If the response is larger than the buffer and takes too long to fill, it can lead to timeouts.

These directives are typically set within your http block or within specific server or location blocks in your Nginx configuration (e.g., /etc/nginx/nginx.conf or files in /etc/nginx/conf.d/).

http {
    # ... other http settings ...

    proxy_connect_timeout       60s;
    proxy_send_timeout          60s;
    proxy_read_timeout          300s;  # Increased timeout for slow upstream responses
    fastcgi_read_timeout        300s;  # For FastCGI/PHP-FPM

    proxy_buffer_size           16k;
    proxy_buffers               4 32k;
    proxy_busy_buffers_size     64k;

    # ... other http settings ...
}

Caution: Indiscriminately increasing timeouts can mask underlying performance issues in your upstream application. It’s better to address the root cause of slowness. However, for long-running processes or complex queries, longer timeouts might be necessary.

Network Stack and System Limits

In high-concurrency scenarios, the operating system’s network stack can become a bottleneck. Rocky Linux 9, like other modern Linux distributions, has configurable limits that can affect socket handling.

Ephemeral Port Range

When Nginx connects to an upstream service via TCP, it uses an ephemeral port from the OS. If Nginx is making a very large number of outbound connections rapidly, it can exhaust the available ephemeral ports. This is less common for typical web server setups but possible in complex proxy chains or high-traffic environments.

# Check current ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range

# Temporarily increase the range (e.g., from 30000-60999 to 1024-65535)
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Make it permanent by editing /etc/sysctl.conf or a file in /etc/sysctl.d/
# Example: echo "net.ipv4.ip_local_port_range = 1024 65535" | sudo tee /etc/sysctl.d/99-ports.conf
# Then apply: sudo sysctl -p /etc/sysctl.d/99-ports.conf

File Descriptor Limits

Every open socket, file, or pipe consumes a file descriptor. If Nginx or the upstream service hits its per-process or system-wide file descriptor limit, it cannot open new connections or files, leading to errors.

# Check current limits for the Nginx process
sudo cat /proc/$(pgrep -f "nginx: worker process" | head -n 1)/limits

# Check system-wide limits
cat /proc/sys/fs/file-max

# Check limits for the 'nginx' user (if applicable, often set in /etc/security/limits.conf)
sudo grep nginx /etc/security/limits.conf

# Example limits.conf entry:
# nginx   soft    nofile  65536
# nginx   hard    nofile  65536

# To apply changes to limits.conf, you might need to restart Nginx or log out/in.
# For system-wide changes:
sudo sysctl -w fs.file-max=2000000

# Make permanent: edit /etc/sysctl.conf or a file in /etc/sysctl.d/
# Example: echo "fs.file-max = 2000000" | sudo tee /etc/sysctl.d/99-files.conf
# Then apply: sudo sysctl -p /etc/sysctl.d/99-files.conf

Ensure that both Nginx and your upstream service (e.g., PHP-FPM) have sufficient file descriptors allocated. For PHP-FPM, this is often managed via systemd service files or /etc/security/limits.conf.

TCP Connection Backlog

When a server is under heavy load, incoming TCP connection requests are placed in a queue (backlog) before being accepted by the application. If this backlog is full, new connection attempts may be dropped or refused, potentially leading to 502 errors if Nginx tries to connect to an upstream that is unable to accept new connections due to a full backlog.

# Check current TCP backlog settings
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog

# Increase backlog (example values)
sudo sysctl -w net.core.somaxconn=4096
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096

# Make permanent by editing /etc/sysctl.conf or a file in /etc/sysctl.d/
# Example:
# echo "net.core.somaxconn = 4096" | sudo tee /etc/sysctl.d/99-tcp.conf
# echo "net.ipv4.tcp_max_syn_backlog = 4096" | sudo tee -a /etc/sysctl.d/99-tcp.conf
# Then apply: sudo sysctl -p /etc/sysctl.d/99-tcp.conf

The somaxconn value should generally be higher than the listen.backlog setting in your PHP-FPM pool configuration (if using Unix sockets) or the equivalent setting for your TCP upstream. For PHP-FPM, listen.backlog is often set in the pool configuration file (e.g., /etc/php-fpm.d/www.conf).

Advanced Troubleshooting: tcpdump and strace

When log analysis and configuration checks don’t reveal the issue, lower-level network and system call tracing can be invaluable.

Using tcpdump for Network Analysis

tcpdump allows you to capture network packets. This is useful for seeing if Nginx is even attempting to connect to the upstream, and what kind of responses (or lack thereof) it’s receiving.

# Capture traffic on the upstream port (e.g., 8080 for a TCP upstream)
sudo tcpdump -i any -nn -s0 'port 8080' -w upstream_traffic.pcap

# Capture traffic related to the PHP-FPM socket (requires capturing on the loopback interface and filtering by process)
# This is more complex as Unix domain sockets don't use IP/port.
# A common approach is to monitor the process making the socket calls.
# For PHP-FPM, you might monitor its process ID (PID).
# First, find the PHP-FPM master PID:
PHP_FPM_PID=$(pgrep -f "php-fpm: master process")
# Then, use strace (see below) or monitor network events if PHP-FPM uses TCP loopback.
# If using Unix sockets, strace is often more direct.

# If Nginx is connecting to a TCP upstream on localhost:
sudo tcpdump -i lo -nn -s0 'tcp port 8080' -w nginx_to_upstream.pcap

Analyze the captured .pcap files using Wireshark or tshark. Look for SYN packets from Nginx, SYN-ACKs from the upstream, RST packets (resets), or simply a lack of response, which would indicate a timeout.

Using strace for System Call Tracing

strace traces system calls made by a process. This is excellent for diagnosing issues with socket operations, file access, and resource allocation at the OS level.

# Trace Nginx worker processes making connections
# Find a worker PID (e.g., 12345)
NGINX_WORKER_PID=$(pgrep -f "nginx: worker process" | head -n 1)
sudo strace -p $NGINX_WORKER_PID -s 1024 -e trace=connect,sendto,recvfrom,accept,open,close -f -o /tmp/nginx_strace.log

# Trace PHP-FPM processes
PHP_FPM_PIDS=$(pgrep -f "php-fpm: pool www")
for pid in $PHP_FPM_PIDS; do
    sudo strace -p $pid -s 1024 -e trace=connect,sendto,recvfrom,accept,open,close -f -o /tmp/php-fpm_${pid}_strace.log &
done

# After running for a while, stop strace (Ctrl+C) and analyze the logs.
# Look for 'connect()' calls that fail with EAGAIN (Resource temporarily unavailable) or ETIMEDOUT.
# For Unix sockets, you'll see 'connect()' calls targeting the socket file.
# Example strace output snippet for a failed connect to a Unix socket:
# connect(3, {sa_family=AF_UNIX, sun_path="/run/php-fpm/www.sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)

strace output can be verbose. Filtering by specific system calls (like connect, accept, read, write) and focusing on errors (return values of -1) is key.

Conclusion

Troubleshooting Nginx 502 and 504 errors requires a systematic approach, starting from application logs and moving down to the operating system’s network stack and resource limits. On Rocky Linux 9, understanding how systemd manages services, along with standard Linux tuning parameters for networking and file descriptors, is crucial. By correlating Nginx error logs with upstream service logs and utilizing tools like strace and tcpdump, you can effectively pinpoint the root cause, whether it’s upstream process exhaustion, network misconfiguration, or OS-level resource constraints.