The Ultimate DevOps Playbook: Tuning Nginx, Gunicorn/FPM, and MySQL on AWS for Shopify

Nginx as a High-Performance Frontend for Shopify Applications

When deploying a custom application or a heavily modified Shopify setup on AWS, Nginx serves as an indispensable frontend. Its strengths lie in efficient static file serving, reverse proxying, load balancing, and SSL termination. For a Shopify-centric application, we’ll focus on tuning Nginx for maximum throughput and minimal latency.

Nginx Configuration Tuning

The core of Nginx performance tuning lies within its nginx.conf file, typically located at /etc/nginx/nginx.conf or within /etc/nginx/conf.d/. We’ll focus on key directives that impact concurrency and resource utilization.

Worker Processes and Connections

The worker_processes directive determines how many worker processes Nginx will spawn. Setting this to auto is generally recommended on multi-core systems, allowing Nginx to detect the number of CPU cores and utilize them efficiently. The worker_connections directive sets the maximum number of simultaneous connections that each worker process can handle. This value, combined with worker_processes, dictates the total connection capacity.

worker_processes auto;

events {
    worker_connections 4096; # Adjust based on system limits and expected load
    multi_accept on;
}

Note: The maximum number of file descriptors available to Nginx processes must be sufficient. Check and adjust using ulimit -n and by configuring /etc/security/limits.conf.

HTTP Request Buffering and Timeouts

Buffering directives control how Nginx handles client request bodies. For large uploads, increasing client_max_body_size is crucial. Timeouts are vital to prevent resource exhaustion from slow or misbehaving clients. client_header_timeout and client_body_timeout should be set to reasonable values.

http {
    # ... other http directives ...

    client_max_body_size 100M; # Example: Allow up to 100MB for uploads
    client_header_timeout 10s;
    client_body_timeout 60s;
    send_timeout 60s;
    lingering_close off; # Can improve performance by closing connections faster
    lingering_time 30s;

    # ... server blocks ...
}

Keep-Alive Connections

Enabling keep-alive connections significantly reduces the overhead of establishing new TCP connections for subsequent requests from the same client. The keepalive_timeout directive controls how long an idle keep-alive connection will remain open. A value between 15 and 60 seconds is typical.

http {
    # ... other http directives ...

    keepalive_timeout 65; # Default is 75, can be tuned
    keepalive_requests 1000; # Max requests per keep-alive connection

    # ... server blocks ...
}

Gzip Compression

Gzip compression can drastically reduce the size of responses sent to clients, improving page load times. Ensure it’s enabled and configured appropriately for your content types.

http {
    # ... other http directives ...

    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6; # Compression level (1-9)
    gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript image/svg+xml;

    # ... server blocks ...
}

Gunicorn/PHP-FPM: The Application Server Layer

The choice between Gunicorn (for Python/Django/Flask) and PHP-FPM (for PHP) dictates the application server configuration. Both aim to manage worker processes that execute your application code.

Gunicorn Tuning (Python)

Gunicorn’s performance is heavily influenced by its worker type and the number of workers. For I/O-bound applications, the gevent or event worker types are preferred. The number of workers is typically set to (2 * number_of_cores) + 1 as a starting point, but this should be profiled.

# Example Gunicorn command line
gunicorn --workers 3 \
         --worker-class gevent \
         --bind 0.0.0.0:8000 \
         --timeout 120 \
         your_project.wsgi:application

Key Gunicorn Directives:

--workers: Number of worker processes.
--worker-class: Type of worker (sync, event, gevent, gaiohttp, uvicorn.workers.UvicornWorker).
--bind: Address and port to bind to.
--timeout: Worker timeout in seconds.
--threads: Number of threads per worker (for sync workers).

PHP-FPM Tuning

PHP-FPM configuration is managed in php-fpm.conf and pool configuration files (e.g., www.conf). The pm (process manager) settings are critical. For production, pm = dynamic or pm = ondemand are common choices.

; Example php-fpm pool configuration (e.g., /etc/php/8.1/fpm/pool.d/www.conf)
[www]
user = www-data
group = www-data
listen = /run/php/php8.1-fpm.sock
listen.owner = www-data
listen.group = www-data
listen.mode = 0660

pm = dynamic
pm.max_children = 50       ; Max number of children at any one time
pm.start_servers = 5       ; Number of children when pm becomes active
pm.min_spare_servers = 2   ; Min number of idle respawns
pm.max_spare_servers = 10  ; Max number of idle respawns
pm.process_idle_timeout = 10s; Value in seconds for FPM to close idle processes
pm.max_requests = 500      ; Max requests before respawning a child

Tuning Considerations for PHP-FPM:

pm.max_children: This is the most critical setting. It should be set based on available RAM. A common formula is max_children = (Total RAM - RAM for OS/other services) / Average RAM per PHP process.
pm.start_servers, pm.min_spare_servers, pm.max_spare_servers: These control how PHP-FPM scales dynamically.
pm.process_idle_timeout: Useful with pm = ondemand to free up resources when not in use.
pm.max_requests: Prevents memory leaks by respawning workers after a certain number of requests.

MySQL RDS Tuning for Shopify Workloads

Shopify applications, especially those with custom apps or heavy data manipulation, can put significant load on the database. AWS RDS (Relational Database Service) offers managed MySQL instances, and tuning is key.

Parameter Groups

AWS RDS uses Parameter Groups to manage database engine configuration. You’ll need to create a custom parameter group to modify these settings. Key parameters to consider:

-- Example parameters to tune in RDS Custom Parameter Group
-- innodb_buffer_pool_size: Crucial for InnoDB performance. Set to 50-75% of instance RAM.
-- innodb_log_file_size: Larger values can improve write performance but increase recovery time.
-- innodb_flush_log_at_trx_commit: Set to 2 for better performance at a slight risk of data loss on crash. 1 is ACID compliant.
-- max_connections: Adjust based on application needs and instance size.
-- query_cache_size: Often disabled in modern MySQL versions (8.0+) due to contention issues.
-- tmp_table_size & max_heap_table_size: Affects performance of complex queries with temporary tables.
-- sort_buffer_size & join_buffer_size: Per-connection buffers, tune cautiously.

Tuning innodb_buffer_pool_size: This is arguably the most important parameter. It caches data and indexes. For a dedicated RDS instance, setting it to 70-75% of the instance’s RAM is a common best practice. For example, on a db.r5.xlarge (16 GiB RAM), you might set it to around 10-12 GiB.

-- Example: Setting innodb_buffer_pool_size in RDS Parameter Group
SET GLOBAL innodb_buffer_pool_size = 12 * 1024 * 1024 * 1024; -- For ~12 GiB

Read Replicas and Sharding

For read-heavy Shopify workloads, implementing RDS Read Replicas is a straightforward way to offload read traffic from the primary instance. For extremely high write volumes or very large datasets, consider database sharding, though this adds significant architectural complexity.

Slow Query Analysis and Optimization

Regularly analyze slow queries to identify bottlenecks. Enable the slow query log in your RDS parameter group and use tools like pt-query-digest or MySQL’s built-in performance schema to pinpoint problematic queries. Optimize these queries by adding appropriate indexes, rewriting them, or denormalizing data where appropriate.

-- Enable slow query log in RDS Parameter Group
slow_query_log = 1
long_query_time = 2 -- Log queries taking longer than 2 seconds
log_output = FILE
-- Consider using 'TABLE' for easier querying via SQL, but it can have performance impact.

-- Example of analyzing slow queries with pt-query-digest (run on an EC2 instance with access to RDS logs)
pt-query-digest /path/to/mysql-slow.log > /path/to/report.txt

Connection Pooling

Ensure your application uses connection pooling. For Python applications using Gunicorn, libraries like SQLAlchemy provide robust connection pooling. For PHP, extensions like PDO can manage connections, but for high-traffic sites, consider external pooling solutions like ProxySQL or MaxScale if direct RDS connections become a bottleneck.

AWS Infrastructure Considerations

The underlying AWS infrastructure plays a crucial role. Choosing the right EC2 instance types for Nginx/application servers and appropriate RDS instance sizes is fundamental. Utilize Elastic Load Balancers (ELBs) for distributing traffic and Auto Scaling Groups to dynamically adjust capacity.

Instance Sizing and Type Selection

For Nginx and application servers (EC2), consider compute-optimized (C-series) or memory-optimized (R-series) instances based on whether your bottleneck is CPU or memory. For RDS, memory-optimized instances (R-series) are generally preferred due to the importance of the buffer pool.

Network Throughput

Ensure your EC2 instances have sufficient network bandwidth. Instances with “Enhanced Networking” (available on most modern instance types) provide higher packet per second (PPS) performance and lower inter-instance latency. For RDS, network performance is tied to the instance class and EBS volume type.

EBS Volume Types for RDS

For RDS, use gp3 (General Purpose SSD) or io1/io2 (Provisioned IOPS SSD) volumes. gp3 offers a good balance of cost and performance, allowing independent scaling of IOPS and throughput. io1/io2 are for I/O-intensive workloads where consistent high performance is critical.

Monitoring and Alerting

Continuous monitoring is non-negotiable. Utilize AWS CloudWatch for metrics on EC2 (CPU utilization, network I/O), RDS (CPU utilization, IOPS, connections, latency), and ELB (request count, latency, healthy hosts). Set up alarms for critical thresholds.

# Example CloudWatch Alarms to consider:
# EC2: High CPU Utilization (>80% for 5 mins)
# EC2: Low Disk Space (<10% free)
# RDS: High CPU Utilization (>80% for 5 mins)
# RDS: High Database Connections (>90% of max_connections)
# RDS: High Read/Write Latency (>100ms for 5 mins)
# ELB: High Request Latency (>2s for 5 mins)
# ELB: Low Healthy Host Count (<2 for 5 mins)

Integrate application-level monitoring (e.g., Sentry, New Relic) to capture errors and performance issues within your Shopify application code itself. Correlating infrastructure metrics with application performance is key to effective troubleshooting.