The Ultimate DevOps Playbook: Tuning Nginx, Gunicorn/FPM, and Elasticsearch on AWS for C

Nginx Tuning for High Throughput on AWS EC2

Optimizing Nginx as a reverse proxy and static file server is crucial for any high-traffic application. On AWS, leveraging EC2 instances requires careful consideration of kernel parameters and Nginx configuration directives to maximize I/O and network throughput. We’ll focus on tuning for a typical web application serving dynamic content via Gunicorn/FPM and static assets.

Kernel Parameter Tuning

Before touching Nginx, ensure the underlying operating system is configured for high concurrency. For Amazon Linux 2 or Ubuntu on EC2, these parameters are critical:

File Descriptors

Nginx, especially with many worker processes and connections, can consume a large number of file descriptors. Increase the system-wide limit and the limit for the Nginx user.

System-wide Limits

Edit /etc/security/limits.conf. Add these lines to set a high limit for all users, particularly the user Nginx runs as (often ‘nginx’ or ‘www-data’).

* soft nofile 65536
* hard nofile 65536

Nginx User Limits (if applicable)

If you have specific limits for the Nginx user, you can add them like this:

nginx soft nofile 65536
nginx hard nofile 65536

Apply Limits

These changes typically require a reboot or at least a re-login for the user. For immediate effect on running processes, you might need to use prlimit, but a reboot is cleaner for persistent changes.

Network Stack Tuning

Optimize TCP/IP settings for high connection rates and efficient data transfer. Edit /etc/sysctl.conf.

# Increase the maximum number of open files
fs.file-max = 2097152

# Increase the maximum number of file descriptors available to all processes
fs.nr_open = 2097152

# Increase the maximum number of sockets that can be bound to a single port
net.core.somaxconn = 4096

# Increase the maximum number of pending connections
net.ipv4.tcp_max_syn_backlog = 4096

# Enable TCP Fast Open (requires kernel support and client support)
net.ipv4.tcp_fastopen = 3

# Increase the maximum number of connections that can be queued
net.core.netdev_max_backlog = 2000

# Increase the maximum number of TCP sockets in TIME-WAIT state
net.ipv4.tcp_max_tw_buckets = 180000

# Reduce the time a TCP socket stays in TIME-WAIT state
net.ipv4.tcp_fin_timeout = 30

# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1

# Enable selective acknowledgment
net.ipv4.tcp_sack = 1

# Enable duplicate acknowledgment
net.ipv4.tcp_dsack = 1

# Increase the maximum number of UDP receive buffer
net.core.rmem_max = 16777216

# Increase the maximum number of UDP send buffer
net.core.wmem_max = 16777216

# Increase the default receive buffer size
net.core.rmem_default = 8388608

# Increase the default send buffer size
net.core.wmem_default = 8388608

# Increase the maximum number of allowed ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Enable TCP keepalive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 5

# Disable ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0

# Disable source routing
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0

# Enable IP forwarding (if Nginx acts as a router, usually not needed for reverse proxy)
# net.ipv4.ip_forward = 1

Apply Sysctl Changes

Apply these changes immediately without a reboot:

sudo sysctl -p

Nginx Configuration Tuning

Now, let’s tune the nginx.conf file, typically located at /etc/nginx/nginx.conf or within /etc/nginx/conf.d/.

Global Directives

In the http block:

http {
# ... other http directives ...

# Increase the maximum number of open file descriptors for Nginx worker processes
worker_rlimit_nofile 65536;

# Set the number of worker processes. Typically set to the number of CPU cores.
# For hyper-threaded CPUs, you might use physical cores.
# Example: For an m5.large (2 vCPUs), use 2. For an m5.4xlarge (16 vCPUs), use 16.
worker_processes auto; # Or specify a number, e.g., worker_processes 16;

# Enable the use of multiple threads for certain operations (e.g., SSL handshake, disk I/O)
# This can be beneficial on multi-core systems.
worker_threads 4; # Adjust based on CPU cores and workload

# Enable epoll event model for Linux. This is highly efficient for high concurrency.
use epoll;

# Set the maximum number of simultaneous connections that each worker process can handle.
# This should be set considering the file descriptor limits and available memory.
# A common starting point is 1024 or 2048, but can be higher.
# Ensure this is less than (worker_rlimit_nofile / worker_processes).
connections 4096;

# Increase the maximum number of allowed open connections to upstream servers.
# This is crucial when Nginx is proxying to Gunicorn/FPM.
# Set this to a value higher than the expected concurrent requests to your backend.
# A good starting point is 4096 or higher.
proxy_max_line_length 10240; # Increase for potentially large headers
proxy_buffer_size 4k;
proxy_buffers 16 4k; # Number of buffers and size per buffer

# Keepalive connections to upstream servers.
# This reduces the overhead of establishing new connections to Gunicorn/FPM.
proxy_http_version 1.1;
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_keepalive_timeout 65s; # Slightly more than upstream keepalive timeout

# Enable gzip compression for static and dynamic content.
# Tune these parameters for optimal compression and CPU usage.
gzip on;
gzip_vary on;
gzip_proxied any; # Compress responses for proxied requests
gzip_comp_level 6; # Compression level (1-9)
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript image/svg+xml;
gzip_min_length 1000; # Minimum response length to compress

# Enable HTTP/2 for faster multiplexing and header compression.
# Requires SSL.
http2 on;

# Client request body settings
client_max_body_size 100M; # Adjust as needed for file uploads

# Timeout for receiving client request body
client_body_timeout 60s;

# Timeout for receiving client request headers
client_header_timeout 60s;

# Enable access log buffering to reduce disk I/O.
access_log /var/log/nginx/access.log combined buffer=16k flush=1m;

# Error log level. 'error' is standard for production. 'debug' for troubleshooting.
error_log /var/log/nginx/error.log warn;

# Open file cache for static assets. Significantly speeds up serving static files.
open_file_cache max=20000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;

# ... other http directives ...
}

Server Block (Example for Proxying)

In your specific server configuration (e.g., /etc/nginx/sites-available/your_app):

server {
    listen 80;
    listen [::]:80;
    server_name your_domain.com www.your_domain.com;

    # Redirect HTTP to HTTPS (if SSL is configured)
    # return 301 https://$host$request_uri;

    # Serve static files directly from Nginx for maximum performance.
    # Configure this path to your static assets directory.
    location /static/ {
        alias /var/www/your_app/static/;
        expires 30d; # Cache static assets for 30 days
        access_log off; # Disable access logging for static files if desired
        add_header Cache-Control "public";
    }

    location /media/ {
        alias /var/www/your_app/media/;
        expires 30d;
        access_log off;
        add_header Cache-Control "public";
    }

    # Proxy dynamic requests to your backend application (Gunicorn/FPM).
    location / {
        proxy_pass http://your_backend_app; # This is an upstream group name
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Increase timeouts for potentially long-running requests
        proxy_connect_timeout 75s;
        proxy_send_timeout 75s;
        proxy_read_timeout 75s;

        # Buffering settings for proxy
        proxy_buffering on;
        proxy_buffers 8 16k; # Adjust buffer size and count
        proxy_buffer_size 32k;
        proxy_busy_buffers_size 64k;
    }

    # Optional: Handle specific API endpoints differently
    # location /api/ {
    #     proxy_pass http://your_backend_app;
    #     # ... other proxy settings ...
    # }

    # Optional: Health check endpoint for load balancers
    location /healthz {
        access_log off;
        return 200 'OK';
        add_header Content-Type text/plain;
    }

    # Optional: Deny access to hidden files
    location ~ /\. {
        deny all;
    }
}

# Define your upstream backend(s)
upstream your_backend_app {
    # For Gunicorn (Python WSGI)
    # If using a Unix socket:
    # server unix:/run/gunicorn/your_app.sock fail_timeout=0;
    # If using TCP:
    server 127.0.0.1:8000 fail_timeout=0; # Adjust port if Gunicorn uses a different one

    # For PHP-FPM
    # If using a Unix socket:
    # server unix:/var/run/php/php7.4-fpm.sock;
    # If using TCP:
    # server 127.0.0.1:9000;

    # Load balancing method (round_robin is default)
    # least_conn; # Use least_conn for better distribution if connections vary in duration
    # ip_hash; # Use ip_hash if session stickiness is required
}

Gunicorn Tuning (Python WSGI)

Gunicorn is a popular WSGI HTTP Server for Python. Tuning it involves managing worker processes and threads.

Worker Processes and Threads

The optimal number of worker processes depends on your application’s I/O bound vs. CPU bound nature and the number of CPU cores available. For I/O bound applications (most web apps), a common recommendation is (2 * Number of CPU Cores) + 1. For CPU-bound tasks, you might stick closer to the number of CPU cores.

Gunicorn also supports a --threads option for each worker, allowing for concurrency within a single process. This is useful if your application is I/O bound and can benefit from handling multiple requests concurrently per worker.

Command-line Example

Assuming you have 4 CPU cores on your EC2 instance:

# For I/O bound applications, using threads
gunicorn --workers 5 --threads 2 --bind 127.0.0.1:8000 your_app.wsgi:application

# For CPU bound applications, fewer threads or no threads
# gunicorn --workers 4 --bind 127.0.0.1:8000 your_app.wsgi:application

Systemd Service File Example

A typical systemd service file for Gunicorn (e.g., /etc/systemd/system/gunicorn.service):

[Unit]
Description=Gunicorn instance to serve your_app
After=network.target

[Service]
User=your_user
Group=your_group
WorkingDirectory=/path/to/your_app
ExecStart=/path/to/your_venv/bin/gunicorn \
    --workers 5 \
    --threads 2 \
    --bind unix:/run/gunicorn/your_app.sock \
    --access-logfile /var/log/gunicorn/access.log \
    --error-logfile /var/log/gunicorn/error.log \
    your_app.wsgi:application

# If binding to TCP:
# ExecStart=/path/to/your_venv/bin/gunicorn \
#     --workers 5 \
#     --threads 2 \
#     --bind 127.0.0.1:8000 \
#     --access-logfile /var/log/gunicorn/access.log \
#     --error-logfile /var/log/gunicorn/error.log \
#     your_app.wsgi:application

Restart=always
StandardOutput=append:/var/log/gunicorn/access.log
StandardError=append:/var/log/gunicorn/error.log

[Install]
WantedBy=multi-user.target

Gunicorn Timeout Settings

The --timeout setting in Gunicorn defines how long a worker will wait for a request to be processed before it’s considered timed out. This should generally be higher than Nginx’s proxy_read_timeout to avoid race conditions, but not excessively high.

# Example: Set timeout to 90 seconds
gunicorn --workers 5 --threads 2 --timeout 90 --bind 127.0.0.1:8000 your_app.wsgi:application

PHP-FPM Tuning

For PHP applications, PHP-FPM (FastCGI Process Manager) is the standard. Tuning involves managing process pools.

Process Manager Settings

Edit your PHP-FPM pool configuration file, typically found in /etc/php/X.Y/fpm/pool.d/www.conf (replace X.Y with your PHP version).

; Example configuration for a pool

[www]
user = www-data
group = www-data
listen = /var/run/php/php7.4-fpm.sock ; Or use TCP: listen = 127.0.0.1:9000

; Process management settings
; 'dynamic' is recommended for most cases. 'static' can be faster but less flexible.
pm = dynamic

; The number of child processes that will be spawned when pm is set to 'dynamic'.
; This value is the desired number of children to be kept busy.
pm.max_children = 50 ; Adjust based on server memory and expected load

; The number of *additional* child processes which will be spawned when the number of
; running children reaches this limit.
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 5

; The maximum number of requests each child process should execute before respawning.
; This helps to free up resources and prevent memory leaks.
pm.max_requests = 500

; The interval after which PHP-FPM will check if it needs to spawn/kill children.
; pm.process_idle_timeout = 10s ; Default is 10s

; Request termination timeout.
; If a script runs longer than this option, the process will be killed.
; This value should be higher than the Nginx proxy_read_timeout.
request_terminate_timeout = 90s

; Slowlog settings (useful for debugging)
; slowlog = /var/log/php/php-fpm-slow.log
; request_slowlog_timeout = 10s

PHP Settings

Ensure your php.ini settings are also appropriate. These are typically found in /etc/php/X.Y/fpm/php.ini.

memory_limit = 256M ; Adjust based on your application's needs
upload_max_filesize = 100M
post_max_size = 100M
max_execution_time = 90 ; Should align with request_terminate_timeout and Nginx timeouts

Elasticsearch Tuning on AWS

Optimizing Elasticsearch, especially on AWS, involves JVM heap settings, shard allocation, and instance type selection. For production, consider using Amazon Elasticsearch Service (now Amazon OpenSearch Service) for managed operations, but if self-hosting on EC2:

JVM Heap Size

This is the most critical setting. Set Xms and Xmx to the same value to prevent heap resizing. The heap size should not exceed 50% of the total system memory, and never more than 30-32GB due to compressed ordinary object pointers (compressed oops).

Edit /etc/elasticsearch/jvm.options (or equivalent path):

-Xms8g
-Xmx8g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=75
-XX:+DisableExplicitGC

Note: The example above sets an 8GB heap. Adjust 8g based on your EC2 instance’s RAM. For an r5.xlarge (16GB RAM), 8GB heap is reasonable. For an r5.2xlarge (32GB RAM), you could go up to 15-16GB.

Shard Allocation and Size

Avoid excessively large shards. A common recommendation is to keep shard sizes between 10GB and 50GB. Too many small shards can also impact performance due to overhead.

Index Settings (Example for creating an index):

PUT /my-index
{
  "settings": {
    "index": {
      "number_of_shards": 3,       # Adjust based on data volume and read/write needs
      "number_of_replicas": 1,     # For high availability, 1 or 2 is common
      "refresh_interval": "5s"     # Default is 1s. Increase for less frequent indexing, decrease for near real-time
    }
  }
}

EC2 Instance Type Selection

For Elasticsearch nodes, choose instance types optimized for memory and I/O. EC2 r5, i3, or i4i instances are excellent choices. i3/i4i instances offer NVMe SSDs which provide very high IOPS and low latency, ideal for Elasticsearch data nodes.

Network Configuration

Ensure your AWS Security Groups allow traffic on port 9200 (HTTP) and 9300 (transport) between your application servers and Elasticsearch nodes. For optimal performance, place your Elasticsearch nodes and application servers within the same VPC and Availability Zone if possible, or use placement groups for low-latency communication.

Monitoring and Iteration

Performance tuning is an iterative process. Continuously monitor your system’s metrics:

Nginx: Active connections, requests per second, error rates, worker connections, cache hit rates. Use nginx-module-vts for detailed Nginx metrics.
Gunicorn/PHP-FPM: Worker utilization, request latency, error rates, memory usage.
Elasticsearch: JVM heap usage, garbage collection activity, indexing rate, search latency, CPU utilization, disk I/O.
System: CPU load, memory usage, network I/O, disk I/O, file descriptor usage.

Use tools like AWS CloudWatch, Prometheus with Grafana, or dedicated APM solutions to collect and visualize these metrics. Make incremental changes and observe their impact.