Server Monitoring Best Practices: Keeping Your Shopify App and Redis Clusters Alive on AWS

Establishing a Robust Monitoring Foundation with AWS CloudWatch

For any production Shopify app, especially one leveraging external services like Redis, a comprehensive monitoring strategy is non-negotiable. AWS CloudWatch serves as the foundational layer for this, providing metrics, logs, and alarms. We’ll focus on key metrics for EC2 instances hosting your app and ElastiCache for Redis, ensuring proactive issue detection and rapid response.

Monitoring EC2 Instances for Your Shopify App Backend

Your application servers, typically running on EC2, are the heart of your Shopify app. Monitoring their health and performance is paramount. We’ll configure CloudWatch to collect essential metrics and set up alarms for critical thresholds.

Key EC2 Metrics to Track

CPU Utilization: High CPU can indicate inefficient code, traffic spikes, or resource contention.
Memory Utilization: Crucial for application performance. While CloudWatch doesn’t natively collect memory metrics from EC2 instances without an agent, we’ll address this.
Network In/Out: High network traffic can signal increased user activity or potential DDoS attacks.
Disk Read/Write Operations & Bytes: Important for identifying I/O bottlenecks.
Disk Queue Length: A sustained high queue length indicates the disk can’t keep up with demand.
Status Checks (System & Instance): CloudWatch automatically monitors these. A failed status check requires immediate investigation.

Collecting Memory Metrics with the CloudWatch Agent

To get memory utilization metrics, you need to install and configure the CloudWatch agent on your EC2 instances. This involves creating a configuration file and ensuring the agent runs as a service.

CloudWatch Agent Configuration (Amazon Linux 2 Example)

First, install the agent:

sudo yum install amazon-cloudwatch-agent -y

Next, create the configuration file. A common location is /opt/aws/amazon-cloudwatch/agent/config.json. This configuration collects system-level metrics, including memory.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "ShopifyApp/EC2",
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait",
          "cpu_usage_user",
          "cpu_usage_system"
        ],
        "totalcpu": true
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "resources": [
          "dev/xvda1"
        ],
        "ignore_file_system_types": [
          "sysfs",
          "devtmpfs",
          "tmpfs",
          "devpts",
          "nfs",
          "local"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent",
          "swap_used_percent"
        ]
      },
      "netstat": {
        "measurement": [
          "tcp_established",
          "tcp_syn_sent",
          "tcp_close_wait"
        ]
      },
      "statsd": {
        "service_address": "udp:localhost:8125"
      },
      "diskio": {
        "measurement": [
          "read_bytes",
          "write_bytes",
          "read_ops",
          "write_ops"
        ],
        "resources": [
          "nvme0n1",
          "nvme1n1"
        ]
      }
    }
  }
}

Start the agent with this configuration:

sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s

Verify the agent is running:

sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a status

Setting Up CloudWatch Alarms for EC2

Once metrics are flowing, create alarms in CloudWatch. Navigate to the CloudWatch console, select “Alarms,” and then “Create alarm.”

Example Alarm Configurations

High CPU Utilization:

Metric: ShopifyApp/EC2/CPUUtilization (or the default EC2 metric if not using the agent namespace)
Threshold: Greater than 85% for 15 minutes.
Action: Send notification to an SNS topic (e.g., for PagerDuty or Slack integration).

Low Memory Utilization:

Metric: ShopifyApp/EC2/MemoryUtilization
Threshold: Less than 15% (indicating high usage) for 10 minutes.
Action: SNS notification.

High Disk Queue Length:

Metric: ShopifyApp/EC2/DiskQueueLength (for the relevant disk, e.g., /dev/xvda1)
Threshold: Greater than 2 for 5 minutes.
Action: SNS notification.

Instance Status Check Failed:

Metric: StatusCheckFailed_System or StatusCheckFailed_Instance
Threshold: Greater than 0 for 1 minute.
Action: SNS notification.

Monitoring AWS ElastiCache for Redis Clusters

Redis is a critical component for caching and session management in many Shopify apps. ElastiCache provides managed Redis, but monitoring its performance and availability is still your responsibility.

Key ElastiCache for Redis Metrics

Engine CPU Utilization: High CPU on Redis nodes can lead to slow responses or timeouts.
Cache Hits/Misses: A low cache hit ratio indicates ineffective caching or insufficient memory.
Evictions: High eviction rates mean Redis is running out of memory and discarding keys, impacting performance.
Curr Connections: Monitor for unexpected spikes or sustained high connection counts.
Replication Lag: For read replicas, ensure lag is minimal to maintain data consistency.
Network Bytes In/Out: Similar to EC2, tracks traffic to/from the Redis nodes.
Freeable Memory: Crucial for understanding available memory headroom.

Setting Up CloudWatch Alarms for ElastiCache

ElastiCache automatically publishes metrics to CloudWatch. You can create alarms directly on these metrics.

Example ElastiCache Alarm Configurations

High Redis CPU Utilization:

Metric: CPUUtilization (under the AWS/ElastiCache namespace)
Threshold: Greater than 80% for 10 minutes.
Action: SNS notification.

High Evictions:

Metric: Evictions
Threshold: Greater than 1000 in 5 minutes (adjust based on your dataset size and traffic).
Action: SNS notification.

Low Freeable Memory:

Metric: FreeableMemory
Threshold: Less than 50MB for 15 minutes.
Action: SNS notification.

High Replication Lag:

Metric: ReplicationLag (for primary nodes)
Threshold: Greater than 5 seconds for 2 minutes.
Action: SNS notification.

High Connection Count:

Metric: CurrConnections
Threshold: Greater than 80% of your node’s connection limit for 5 minutes.
Action: SNS notification.

Centralized Logging with AWS CloudWatch Logs

Beyond metrics, logs are indispensable for debugging and understanding application behavior. Centralizing logs from your EC2 instances and potentially Redis slow logs (if enabled) into CloudWatch Logs provides a single pane of glass for analysis.

Configuring EC2 Instance Logs for CloudWatch Logs

The CloudWatch agent can also be configured to tail log files and send them to CloudWatch Logs. Update your config.json file (the same one used for metrics) to include a logs section.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "ShopifyApp/EC2",
    "metrics_collected": {
      // ... (previous metrics configuration) ...
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "ShopifyApp/Nginx/Access",
            "log_stream_name": "{instance_id}/nginx_access"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "ShopifyApp/Nginx/Error",
            "log_stream_name": "{instance_id}/nginx_error"
          },
          {
            "file_path": "/var/log/your_app.log",
            "log_group_name": "ShopifyApp/Application",
            "log_stream_name": "{instance_id}/app_log"
          }
        ]
      }
    }
  }
}

After updating the configuration, restart the agent:

sudo /opt/aws/amazon-cloudwatch/agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch/agent/config.json -s

Leveraging CloudWatch Logs Insights

Once logs are ingested, CloudWatch Logs Insights becomes your primary tool for querying and analyzing them. This is invaluable for debugging specific errors, tracing requests, or identifying performance bottlenecks.

Example Log Insights Queries

Find all HTTP 5xx errors from Nginx in the last hour:

fields @timestamp, @message
| filter @message like /HTTP\/1\.[01]\" 5\d\d/
| sort @timestamp desc
| limit 100

Count unique IP addresses accessing your app in the last 24 hours:

stats count(distinct client_ip) by bin(1h)
| sort @bin desc

Analyze application log errors and group by error message:

filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 50

Proactive Health Checks and Synthetic Monitoring

While CloudWatch metrics and logs are reactive and diagnostic, proactive health checks ensure your application is not only running but also responding correctly to external requests. AWS offers services like Route 53 Health Checks and CloudWatch Synthetics for this purpose.

Route 53 Health Checks

Configure Route 53 health checks to periodically ping a specific endpoint on your application (e.g., /health). If multiple health checks fail, Route 53 can automatically stop sending traffic to unhealthy instances, preventing users from hitting broken servers. This is particularly effective when integrated with Auto Scaling Groups.

CloudWatch Synthetics Canaries

Synthetics allows you to deploy Lambda functions (Canaries) that simulate user interactions. You can create Canaries to:

Make HTTP requests to your application’s critical endpoints.
Verify API responses.
Check if Redis is reachable and responding to PING commands.
Test key user flows (e.g., adding an item to the cart).

These Canaries run on a schedule and publish metrics (success/failure, duration) to CloudWatch, which can then trigger alarms. This provides an external perspective on your application’s availability and performance.

Integrating with Alerting and Incident Management

Raw alerts are only useful if they reach the right people at the right time. AWS Simple Notification Service (SNS) is the glue that connects your CloudWatch alarms to your incident response workflow.

SNS Topic Configuration

Create an SNS topic (e.g., ShopifyApp-Production-Alerts). Configure your CloudWatch alarms to publish notifications to this topic. Then, create subscriptions to this topic:

Email: For direct notifications to engineers.
AWS Lambda: To trigger custom alert processing, enrichment, or routing logic.
AWS SQS: To queue alerts for a dedicated processing service.
HTTP/S endpoints: To integrate with third-party incident management tools like PagerDuty, Opsgenie, or VictorOps.

For PagerDuty, you’ll typically use an “Events API v2” integration, which involves creating an integration key in PagerDuty and configuring an SNS subscription to send events to that key via an HTTP/S endpoint. This ensures critical alerts trigger on-call rotations.

Continuous Improvement and Review

Monitoring is not a set-it-and-forget-it discipline. Regularly review your:

Alert thresholds: Are they too noisy? Are they too sensitive, leading to missed incidents?
Key metrics: Are there new metrics that would provide better insight into application health?
Log queries: Can you automate common diagnostic queries?
Incident response: After an incident, analyze what monitoring gaps existed and how to address them.

By implementing these practices, you build a resilient monitoring system that keeps your Shopify app and its critical Redis infrastructure running smoothly on AWS, minimizing downtime and ensuring a positive experience for your users.