Server Monitoring Best Practices: Keeping Your Shopify App and Redis Clusters Alive on AWS
Establishing a Robust Monitoring Foundation with AWS CloudWatch
For any production Shopify app, especially one leveraging external services like Redis, a comprehensive monitoring strategy is non-negotiable. AWS CloudWatch serves as the foundational layer for this, providing metrics, logs, and alarms. We’ll focus on key metrics for EC2 instances hosting your app and ElastiCache for Redis, ensuring proactive issue detection and rapid response.
Monitoring EC2 Instances for Your Shopify App Backend
Your application servers, typically running on EC2, are the heart of your Shopify app. Monitoring their health and performance is paramount. We’ll configure CloudWatch to collect essential metrics and set up alarms for critical thresholds.
Key EC2 Metrics to Track
- CPU Utilization: High CPU can indicate inefficient code, traffic spikes, or resource contention.
- Memory Utilization: Crucial for application performance. While CloudWatch doesn’t natively collect memory metrics from EC2 instances without an agent, we’ll address this.
- Network In/Out: High network traffic can signal increased user activity or potential DDoS attacks.
- Disk Read/Write Operations & Bytes: Important for identifying I/O bottlenecks.
- Disk Queue Length: A sustained high queue length indicates the disk can’t keep up with demand.
- Status Checks (System & Instance): CloudWatch automatically monitors these. A failed status check requires immediate investigation.
Collecting Memory Metrics with the CloudWatch Agent
To get memory utilization metrics, you need to install and configure the CloudWatch agent on your EC2 instances. This involves creating a configuration file and ensuring the agent runs as a service.
CloudWatch Agent Configuration (Amazon Linux 2 Example)
First, install the agent:
sudo yum install amazon-cloudwatch-agent -y
Next, create the configuration file. A common location is /opt/aws/amazon-cloudwatch/agent/config.json. This configuration collects system-level metrics, including memory.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "ShopifyApp/EC2",
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"totalcpu": true
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"dev/xvda1"
],
"ignore_file_system_types": [
"sysfs",
"devtmpfs",
"tmpfs",
"devpts",
"nfs",
"local"
]
},
"mem": {
"measurement": [
"mem_used_percent",
"swap_used_percent"
]
},
"netstat": {
"measurement": [
"tcp_established",
"tcp_syn_sent",
"tcp_close_wait"
]
},
"statsd": {
"service_address": "udp:localhost:8125"
},
"diskio": {
"measurement": [
"read_bytes",
"write_bytes",
"read_ops",
"write_ops"
],
"resources": [
"nvme0n1",
"nvme1n1"
]
}
}
}
}
Start the agent with this configuration:
sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s
Verify the agent is running:
sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a status
Setting Up CloudWatch Alarms for EC2
Once metrics are flowing, create alarms in CloudWatch. Navigate to the CloudWatch console, select “Alarms,” and then “Create alarm.”
Example Alarm Configurations
- High CPU Utilization:
- Metric:
ShopifyApp/EC2/CPUUtilization(or the default EC2 metric if not using the agent namespace) - Threshold: Greater than 85% for 15 minutes.
- Action: Send notification to an SNS topic (e.g., for PagerDuty or Slack integration).
- Low Memory Utilization:
- Metric:
ShopifyApp/EC2/MemoryUtilization - Threshold: Less than 15% (indicating high usage) for 10 minutes.
- Action: SNS notification.
- High Disk Queue Length:
- Metric:
ShopifyApp/EC2/DiskQueueLength(for the relevant disk, e.g.,/dev/xvda1) - Threshold: Greater than 2 for 5 minutes.
- Action: SNS notification.
- Instance Status Check Failed:
- Metric:
StatusCheckFailed_SystemorStatusCheckFailed_Instance - Threshold: Greater than 0 for 1 minute.
- Action: SNS notification.
Monitoring AWS ElastiCache for Redis Clusters
Redis is a critical component for caching and session management in many Shopify apps. ElastiCache provides managed Redis, but monitoring its performance and availability is still your responsibility.
Key ElastiCache for Redis Metrics
- Engine CPU Utilization: High CPU on Redis nodes can lead to slow responses or timeouts.
- Cache Hits/Misses: A low cache hit ratio indicates ineffective caching or insufficient memory.
- Evictions: High eviction rates mean Redis is running out of memory and discarding keys, impacting performance.
- Curr Connections: Monitor for unexpected spikes or sustained high connection counts.
- Replication Lag: For read replicas, ensure lag is minimal to maintain data consistency.
- Network Bytes In/Out: Similar to EC2, tracks traffic to/from the Redis nodes.
- Freeable Memory: Crucial for understanding available memory headroom.
Setting Up CloudWatch Alarms for ElastiCache
ElastiCache automatically publishes metrics to CloudWatch. You can create alarms directly on these metrics.
Example ElastiCache Alarm Configurations
- High Redis CPU Utilization:
- Metric:
CPUUtilization(under theAWS/ElastiCachenamespace) - Threshold: Greater than 80% for 10 minutes.
- Action: SNS notification.
- High Evictions:
- Metric:
Evictions - Threshold: Greater than 1000 in 5 minutes (adjust based on your dataset size and traffic).
- Action: SNS notification.
- Low Freeable Memory:
- Metric:
FreeableMemory - Threshold: Less than 50MB for 15 minutes.
- Action: SNS notification.
- High Replication Lag:
- Metric:
ReplicationLag(for primary nodes) - Threshold: Greater than 5 seconds for 2 minutes.
- Action: SNS notification.
- High Connection Count:
- Metric:
CurrConnections - Threshold: Greater than 80% of your node’s connection limit for 5 minutes.
- Action: SNS notification.
Centralized Logging with AWS CloudWatch Logs
Beyond metrics, logs are indispensable for debugging and understanding application behavior. Centralizing logs from your EC2 instances and potentially Redis slow logs (if enabled) into CloudWatch Logs provides a single pane of glass for analysis.
Configuring EC2 Instance Logs for CloudWatch Logs
The CloudWatch agent can also be configured to tail log files and send them to CloudWatch Logs. Update your config.json file (the same one used for metrics) to include a logs section.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "ShopifyApp/EC2",
"metrics_collected": {
// ... (previous metrics configuration) ...
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "ShopifyApp/Nginx/Access",
"log_stream_name": "{instance_id}/nginx_access"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "ShopifyApp/Nginx/Error",
"log_stream_name": "{instance_id}/nginx_error"
},
{
"file_path": "/var/log/your_app.log",
"log_group_name": "ShopifyApp/Application",
"log_stream_name": "{instance_id}/app_log"
}
]
}
}
}
}
After updating the configuration, restart the agent:
sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s
Leveraging CloudWatch Logs Insights
Once logs are ingested, CloudWatch Logs Insights becomes your primary tool for querying and analyzing them. This is invaluable for debugging specific errors, tracing requests, or identifying performance bottlenecks.
Example Log Insights Queries
- Find all HTTP 5xx errors from Nginx in the last hour:
fields @timestamp, @message | filter @message like /HTTP\/1\.[01]\" 5\d\d/ | sort @timestamp desc | limit 100
- Count unique IP addresses accessing your app in the last 24 hours:
stats count(distinct client_ip) by bin(1h) | sort @bin desc
- Analyze application log errors and group by error message:
filter @message like /ERROR/ | stats count(*) as errorCount by @message | sort errorCount desc | limit 50
Proactive Health Checks and Synthetic Monitoring
While CloudWatch metrics and logs are reactive and diagnostic, proactive health checks ensure your application is not only running but also responding correctly to external requests. AWS offers services like Route 53 Health Checks and CloudWatch Synthetics for this purpose.
Route 53 Health Checks
Configure Route 53 health checks to periodically ping a specific endpoint on your application (e.g., /health). If multiple health checks fail, Route 53 can automatically stop sending traffic to unhealthy instances, preventing users from hitting broken servers. This is particularly effective when integrated with Auto Scaling Groups.
CloudWatch Synthetics Canaries
Synthetics allows you to deploy Lambda functions (Canaries) that simulate user interactions. You can create Canaries to:
- Make HTTP requests to your application’s critical endpoints.
- Verify API responses.
- Check if Redis is reachable and responding to PING commands.
- Test key user flows (e.g., adding an item to the cart).
These Canaries run on a schedule and publish metrics (success/failure, duration) to CloudWatch, which can then trigger alarms. This provides an external perspective on your application’s availability and performance.
Integrating with Alerting and Incident Management
Raw alerts are only useful if they reach the right people at the right time. AWS Simple Notification Service (SNS) is the glue that connects your CloudWatch alarms to your incident response workflow.
SNS Topic Configuration
Create an SNS topic (e.g., ShopifyApp-Production-Alerts). Configure your CloudWatch alarms to publish notifications to this topic. Then, create subscriptions to this topic:
- Email: For direct notifications to engineers.
- AWS Lambda: To trigger custom alert processing, enrichment, or routing logic.
- AWS SQS: To queue alerts for a dedicated processing service.
- HTTP/S endpoints: To integrate with third-party incident management tools like PagerDuty, Opsgenie, or VictorOps.
For PagerDuty, you’ll typically use an “Events API v2” integration, which involves creating an integration key in PagerDuty and configuring an SNS subscription to send events to that key via an HTTP/S endpoint. This ensures critical alerts trigger on-call rotations.
Continuous Improvement and Review
Monitoring is not a set-it-and-forget-it discipline. Regularly review your:
- Alert thresholds: Are they too noisy? Are they too sensitive, leading to missed incidents?
- Key metrics: Are there new metrics that would provide better insight into application health?
- Log queries: Can you automate common diagnostic queries?
- Incident response: After an incident, analyze what monitoring gaps existed and how to address them.
By implementing these practices, you build a resilient monitoring system that keeps your Shopify app and its critical Redis infrastructure running smoothly on AWS, minimizing downtime and ensuring a positive experience for your users.