Server Monitoring Best Practices: Keeping Your Shopify App and PostgreSQL Clusters Alive on AWS
Establishing a Robust Monitoring Foundation with AWS CloudWatch
For any production Shopify application, especially those with a PostgreSQL backend on AWS, a comprehensive monitoring strategy is non-negotiable. AWS CloudWatch serves as the foundational layer for collecting and tracking metrics, collecting log files, and setting alarms. We’ll focus on key metrics for both EC2 instances hosting your application and RDS instances for your PostgreSQL cluster.
EC2 Instance Metrics for Application Servers
Your Shopify app likely runs on EC2 instances. Essential metrics to monitor include CPU utilization, network in/out, disk read/write operations, and disk-free space. High CPU can indicate inefficient code or increased traffic. Network saturation can bottleneck requests. Disk I/O and space are critical for application performance and stability.
Beyond the default CloudWatch metrics, consider installing the CloudWatch Agent to collect custom metrics and logs. This allows for deeper insights into application-specific performance. For instance, you might want to track the number of active HTTP requests or the latency of specific API endpoints.
Configuring the CloudWatch Agent for Custom Metrics and Logs
The CloudWatch agent can be configured via a JSON file. Here’s an example for collecting Nginx access logs and a custom application metric (e.g., a counter for failed API requests). Ensure the agent is installed and running on your EC2 instances.
First, create the configuration file, typically located at /opt/aws/amazon-cloudwatch-agent/bin/config.json.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "ShopifyApp/EC2",
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"totalcpu_time_metrics": {
"measurement": [
"cpu_usage_total"
],
"unit": "Percent"
}
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"dev/xvda1"
],
"ignore_file_system_types": [
"sysfs",
"devtmpfs",
"tmpfs",
"devfs",
"iso9660",
"overlay",
"nfs",
"cgroup"
]
},
"mem": {
"measurement": [
"mem_used_percent"
]
},
"net": {
"measurement": [
"bytes_sent",
"bytes_recv",
"packets_sent",
"packets_recv"
]
},
"statsd": {
"service_address": "udp:localhost:8125",
"metrics_collection_interval": 60
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "ShopifyApp/Nginx/Access",
"log_stream_name": "{instance_id}/access.log",
"timezone": "UTC"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "ShopifyApp/Nginx/Error",
"log_stream_name": "{instance_id}/error.log",
"timezone": "UTC"
}
]
}
}
}
}
After saving the configuration, start the agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
To send custom application metrics (e.g., from a PHP application using StatsD), ensure your application is configured to send metrics to udp://localhost:8125. For example, in PHP, you might use a library like php-statsd.
RDS PostgreSQL Cluster Metrics
For your PostgreSQL cluster hosted on Amazon RDS, CloudWatch provides a rich set of metrics. Key indicators include CPU Utilization, Database Connections, Read/Write IOPS, Read/Write Latency, Freeable Memory, and Disk Queue Depth. High connection counts can lead to performance degradation, while low freeable memory can cause excessive swapping. Monitoring IOPS and latency is crucial for identifying storage bottlenecks.
Essential RDS Metrics to Monitor
- CPUUtilization: Percentage of active CPU capacity.
- DatabaseConnections: Number of active database connections.
- ReadIOPS, WriteIOPS: Number of read/write operations per second.
- ReadLatency, WriteLatency: Average time taken for read/write operations.
- FreeableMemory: Amount of RAM available for caching and buffers.
- DiskQueueDepth: Number of outstanding I/O requests waiting to be processed.
- EngineUptime: Uptime of the database engine.
- AuroraConnections (if using Aurora PostgreSQL): Number of active connections to the Aurora cluster.
Setting Up CloudWatch Alarms
Proactive alerting is paramount. Configure CloudWatch Alarms to notify your team when critical thresholds are breached. This allows for timely intervention before issues impact users.
Example Alarms for EC2 Instances
These alarms can be set up via the AWS Management Console or using Infrastructure as Code tools like Terraform or CloudFormation.
Alarm: High CPU Utilization on Application Servers
- Metric:
CPUUtilization(under namespaceAWS/EC2) - Statistic:
Average - Period:
5 Minutes - Threshold Type:
Static - Condition:
> 85% - Evaluation Periods:
3(meaning the condition must be met for 15 consecutive minutes) - Actions: Send notification to SNS topic (e.g., for Slack or PagerDuty integration).
Alarm: Low Disk Space on Application Servers
- Metric:
used_percent(custom metric from CloudWatch Agent, under namespaceShopifyApp/EC2) - Statistic:
Average - Period:
1 Hour - Threshold Type:
Static - Condition:
> 90% - Evaluation Periods:
1 - Actions: Send notification to SNS topic.
Example Alarms for RDS PostgreSQL Clusters
Alarm: High Database Connections
- Metric:
DatabaseConnections(under namespaceAWS/RDS) - Statistic:
Maximum - Period:
5 Minutes - Threshold Type:
Static - Condition:
> 80% of max_connections(You’ll need to know yourmax_connectionssetting for your RDS instance. For example, ifmax_connectionsis 500, set the threshold to 400). - Evaluation Periods:
2 - Actions: Send notification to SNS topic.
Alarm: Low Freeable Memory
- Metric:
FreeableMemory(under namespaceAWS/RDS) - Statistic:
Average - Period:
5 Minutes - Threshold Type:
Static - Condition:
< 500MB(Adjust based on your instance size and workload) - Evaluation Periods:
3 - Actions: Send notification to SNS topic.
Advanced PostgreSQL Performance Tuning and Monitoring
While CloudWatch provides essential infrastructure-level metrics, deep PostgreSQL performance analysis requires delving into the database itself. This involves understanding query performance, connection pooling, and vacuuming.
Leveraging `pg_stat_statements`
The pg_stat_statements extension is invaluable for identifying slow or frequently executed queries. It tracks execution statistics for all SQL statements executed by the server.
Enabling and Configuring `pg_stat_statements`
First, ensure the extension is installed and enabled. This is typically done in your postgresql.conf (or via RDS parameter groups). You’ll need to set shared_preload_libraries and then restart PostgreSQL.
# In your postgresql.conf or RDS Parameter Group shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.max = 10000 # Number of statements to track pg_stat_statements.track = all # Track all statements pg_stat_statements.track_utility = off # Exclude utility commands pg_stat_statements.track_planning = off # Track planning time if needed
After restarting, create the extension in your database:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Now, you can query the pg_stat_statements view to find problematic queries. Sort by total_time to find the most time-consuming queries, or by calls to find the most frequent ones.
-- Top 10 most time-consuming queries
SELECT
query,
calls,
total_time,
rows,
mean_time,
stddev_time
FROM
pg_stat_statements
ORDER BY
total_time DESC
LIMIT 10;
-- Top 10 most frequently executed queries
SELECT
query,
calls,
total_time,
rows,
mean_time,
stddev_time
FROM
pg_stat_statements
ORDER BY
calls DESC
LIMIT 10;
To integrate these insights into your monitoring, you can periodically run these queries and push the results as custom metrics to CloudWatch using the CloudWatch Agent or a custom script. For example, a Python script could query pg_stat_statements and use the AWS SDK to publish metrics.
Connection Pooling with PgBouncer
High connection counts to PostgreSQL can exhaust server resources. PgBouncer is a lightweight connection pooler that sits between your application and the database, managing a pool of connections and reusing them for multiple application requests. This significantly reduces the overhead of establishing new connections.
Setting up and Configuring PgBouncer
Install PgBouncer on your application servers or a dedicated instance. The primary configuration files are pgbouncer.ini and userlist.txt.
[database] ; Database connection string. ; Example: "dbname=mydb host=localhost port=5432" ; For RDS, you'll typically connect to the RDS endpoint. ; Ensure your RDS instance allows connections from the PgBouncer host. ; Example for RDS: ; host=your-rds-endpoint.region.rds.amazonaws.com port=5432 dbname=your_db_name [pgbouncer] listen_addr = 0.0.0.0:6432 listen_port = 6432 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt pool_mode = session ; or transaction, depending on your app's needs max_client_conn = 1000 ; Max connections from clients (your app servers) default_pool_size = 20 ; Default pool size per database min_pool_size = 5 pool_timeout = 300 ; seconds server_idle_timeout = 60 ; seconds log_connections = 0 log_disconnections = 0 log_pooler_errors = 1 pidfile = /var/run/pgbouncer/pgbouncer.pid logfile = /var/log/pgbouncer/pgbouncer.log
The userlist.txt file contains credentials for connecting to PostgreSQL. PgBouncer uses these to establish its pool.
"your_db_user" "md5_hashed_password_for_your_db_user"
You’ll need to generate the MD5 hash of the password. Then, start PgBouncer and configure your application to connect to your_pgbouncer_host:6432 instead of the direct PostgreSQL endpoint.
Monitoring PgBouncer
PgBouncer exposes its status via a special database named pgbouncer. You can connect to this database (using psql or any SQL client) to get statistics.
-- Connect to the pgbouncer database: -- psql -h localhost -p 6432 -U your_pgbouncer_admin_user -d pgbouncer -- Show pool statistics SHOW POOLS; -- Show server statistics SHOW SERVERS; -- Show client statistics SHOW CLIENTS; -- Show configuration SHOW CONFIG; -- Show database list SHOW DATABASES;
You can script these queries to collect metrics like pool_size, current_transactions, avg_wait, and max_wait. These can then be sent to CloudWatch as custom metrics. For example, a simple shell script could execute these queries and parse the output.
Automated Backups and Disaster Recovery
While not strictly “monitoring,” robust backup and recovery strategies are essential for keeping your cluster alive. AWS RDS handles automated backups for you, but understanding and testing your recovery process is critical.
RDS Automated Backups
RDS automatically performs a full, incremental backup of your DB instance daily during your configured backup window. Transaction logs are backed up every five minutes, enabling point-in-time recovery (PITR) to any second within your retention period (up to 35 days).
Point-in-Time Recovery (PITR)
In the event of accidental data deletion or corruption, PITR is your best friend. You can restore your DB instance to a specific point in time. This process creates a new RDS instance, leaving your original instance intact.
# Example using AWS CLI to restore to a specific time
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier your-db-instance-id \
--target-db-instance-identifier your-db-instance-id-restored \
--restore-time "2023-10-27T10:30:00Z" \
--db-subnet-group-name your-db-subnet-group \
--vpc-security-group-ids sg-xxxxxxxxxxxxxxxxx \
--db-instance-class db.r5.large \
--tags Key=Environment,Value=Staging Key=RestoredFrom,Value=Production
It’s crucial to regularly test your PITR process by restoring to a staging environment to ensure data integrity and that your application can connect to the restored database. This validates your recovery time objectives (RTO) and recovery point objectives (RPO).
Proactive Health Checks and Maintenance
Beyond reactive monitoring, implementing proactive health checks and scheduled maintenance can prevent many common issues.
PostgreSQL Vacuuming and Autovacuum
PostgreSQL uses MVCC (Multi-Version Concurrency Control), which means that when rows are updated or deleted, the old versions are not immediately removed. The VACUUM command reclaims storage occupied by dead tuples and can prevent table bloat. Autovacuum is a background process that handles this automatically.
Ensure autovacuum is enabled and properly tuned. Default settings might not be sufficient for high-throughput applications. Monitor autovacuum activity using pg_stat_activity and pg_stat_user_tables.
-- Check autovacuum status
SELECT
datname,
pid,
usename,
query,
state,
now() - query_start AS duration
FROM
pg_stat_activity
WHERE
query LIKE 'autovacuum:%'
ORDER BY
duration DESC;
-- Check table bloat
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
pg_size_pretty(pg_table_size(oid)) AS table_size,
pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
CASE
WHEN n_live_tup <> 0 THEN (n_dead_tup::numeric / n_live_tup) * 100
ELSE 0
END AS dead_tuple_ratio
FROM
pg_stat_user_tables
WHERE
n_live_tup > 0 AND n_dead_tup > 0
ORDER BY
dead_tuple_ratio DESC
LIMIT 20;
If autovacuum is struggling, you might need to adjust parameters like autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, autovacuum_analyze_threshold, and autovacuum_analyze_scale_factor in your RDS parameter group. For very large tables or specific workloads, manual `VACUUM FULL` (which requires an exclusive lock) might be necessary during maintenance windows, but this should be done with extreme caution.
Regularly Reviewing Logs
Both application logs (e.g., Nginx, your Shopify app’s logs) and PostgreSQL logs are critical for diagnosing issues. Ensure logs are being collected by CloudWatch Logs and that you have a strategy for analyzing them. Consider using CloudWatch Logs Insights for powerful querying capabilities.
Example CloudWatch Logs Insights Query for PostgreSQL Errors
fields @timestamp, @message | filter @message like /ERROR/ or @message like /FATAL/ or @message like /PANIC/ | sort @timestamp desc | limit 50
Regularly scheduled log reviews, even if automated alerts haven’t fired, can uncover subtle issues before they escalate. This includes looking for repeated warnings, unusual error patterns, or performance degradation hints.