Server Monitoring Best Practices: Keeping Your Magento 2 App and Redis Clusters Alive on AWS

Proactive Health Checks for Magento 2 on AWS EC2

Maintaining a high-availability Magento 2 instance on AWS requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to focus on application-specific metrics and external dependencies. This section details essential checks for your EC2 instances hosting Magento 2.

EC2 Instance Metrics: Beyond the Basics

AWS CloudWatch provides fundamental EC2 metrics. While CPUUtilization, MemoryUtilization (requires the CloudWatch agent), and Disk I/O are crucial, they are often lagging indicators. We need to supplement these with more granular, real-time checks.

Essential CloudWatch Alarms Configuration

Configure alarms for the following metrics. Set thresholds that reflect your application’s performance baseline, aiming for proactive alerts rather than reactive firefighting.

CPUUtilization: Alarm if sustained above 80% for 5 minutes.
NetworkIn/NetworkOut: Alarm if exceeding expected traffic patterns (e.g., sudden spikes indicating potential DDoS or runaway processes).
DiskQueueLength: Alarm if consistently above 2 for any attached EBS volume. This indicates I/O bottlenecks.
StatusCheckFailed: Alarm immediately if either InstanceStatusChecks or SystemStatusChecks fail. These indicate underlying AWS infrastructure or host issues.

For MemoryUtilization, ensure the CloudWatch agent is installed and configured to collect memory metrics. A common threshold is alarming if MemoryUtilization exceeds 90% for 5 minutes.

Application-Level Monitoring with CloudWatch Agent and Custom Metrics

CloudWatch Agent is indispensable for collecting logs and custom metrics from your EC2 instances. This allows us to monitor Magento 2’s specific health.

Configuring CloudWatch Agent for Magento 2 Logs

The agent can tail Magento 2’s log files and send them to CloudWatch Logs. This is vital for debugging and identifying application errors.

Create or edit the CloudWatch agent configuration file (e.g., /opt/aws/amazon-cloudwatch-agent/bin/config.json).

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "cwagent"
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "/var/www/html/magento2/var/log/system.log",
                        "log_group_name": "/aws/magento2/ec2/system",
                        "log_stream_name": "{instance_id}/system.log"
                    },
                    {
                        "file_path": "/var/www/html/magento2/var/log/exception.log",
                        "log_group_name": "/aws/magento2/ec2/exception",
                        "log_stream_name": "{instance_id}/exception.log"
                    },
                    {
                        "file_path": "/var/www/html/magento2/var/log/debug.log",
                        "log_group_name": "/aws/magento2/ec2/debug",
                        "log_stream_name": "{instance_id}/debug.log"
                    }
                ]
            }
        }
    },
    "metrics": {
        "metrics_collected": {
            "cpu": {
                "measurement": [
                    "cpu_usage_idle",
                    "cpu_usage_user",
                    "cpu_usage_system",
                    "cpu_usage_iowait",
                    "cpu_usage_steal"
                ],
                "metrics_collection_interval": 60,
                "totalcpu": true
            },
            "disk": {
                "measurement": [
                    "used_percent",
                    "inodes_free"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "/"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "net": {
                "measurement": [
                    "bytes_sent",
                    "bytes_recv",
                    "packets_sent",
                    "packets_recv"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

After saving the configuration, restart the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

Custom Metrics for Magento 2 Processes

We can use the CloudWatch agent to push custom metrics, such as the status of critical Magento 2 processes (e.g., PHP-FPM, Nginx). A simple script can check process status and report it.

Create a script (e.g., /opt/scripts/check_magento_processes.sh):

#!/bin/bash

# Define processes to monitor
PROCESSES=("php-fpm" "nginx")
NAMESPACE="Magento2/EC2/Processes"

for PROC in "${PROCESSES[@]}"; do
    if pgrep -x "$PROC" > /dev/null; then
        STATUS=1 # Running
    else
        STATUS=0 # Not Running
    fi
    # Publish metric to CloudWatch
    aws cloudwatch put-metric-data --metric-name "${PROC}_status" --namespace "$NAMESPACE" --value $STATUS --dimensions InstanceId=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) --unit Count
done

Make the script executable:

chmod +x /opt/scripts/check_magento_processes.sh

Schedule this script to run periodically using cron (e.g., every minute):

echo "* * * * * /opt/scripts/check_magento_processes.sh" | sudo tee /etc/cron.d/check_magento_processes

Then, configure the CloudWatch agent to collect these custom metrics. Add the following to your config.json under the metrics section:

            "procstat": [
                {
                    "pattern": "php-fpm",
                    "measurement": [
                        "pid_count"
                    ],
                    "metrics_collection_interval": 60
                },
                {
                    "pattern": "nginx",
                    "measurement": [
                        "pid_count"
                    ],
                    "metrics_collection_interval": 60
                }
            ]

Note: The `aws cloudwatch put-metric-data` command is a simpler way to push custom metrics directly. The `procstat` plugin in the agent is more for collecting process-level statistics like CPU/memory usage per process. For a simple “is it running?” check, the `aws cli` method is often sufficient and easier to manage.

Web Server (Nginx) Monitoring

Nginx is the gateway for your Magento 2 application. Monitoring its health and performance is paramount.

Nginx Status and Performance Metrics

Enable the Nginx stub_status module to expose basic metrics. In your Nginx configuration (e.g., /etc/nginx/nginx.conf or a site-specific conf file):

http {
    # ... other http configurations ...

    server {
        listen 80;
        server_name your_domain.com;

        location /nginx_status {
            stub_status;
            allow 127.0.0.1; # Restrict access to localhost
            deny all;
        }

        # ... other locations ...
    }
}

Reload Nginx to apply changes:

sudo systemctl reload nginx

You can then fetch these metrics:

curl http://localhost/nginx_status

The output will look like:

Active connections: 123 server accepts handled requests 1667890 1667890 12345678 Reading: 1 Writing: 6 Waiting: 116

Use a script (similar to the process check script) to parse these values and send them as custom CloudWatch metrics (e.g., ActiveConnections, RequestsPerSecond, ReadingConnections, WritingConnections, WaitingConnections). The handled and accepts values can be used to calculate requests per second over time.

Nginx Error Log Monitoring

Ensure Nginx error logs (e.g., /var/log/nginx/error.log) are being collected by the CloudWatch agent as configured previously. Set up CloudWatch Logs metric filters to alert on specific error patterns (e.g., “client denied by server configuration”, “upstream timed out”).

Redis Cluster Monitoring on AWS ElastiCache

For Magento 2, Redis is critical for caching. Using AWS ElastiCache for Redis simplifies management, but robust monitoring is still essential.

ElastiCache Metrics in CloudWatch

ElastiCache automatically publishes a rich set of metrics to CloudWatch. Key metrics to monitor include:

CacheHits/CacheMisses: High miss rates indicate potential performance degradation or insufficient cache capacity.
Evictions: High eviction rates mean Redis is discarding data due to memory pressure. This can lead to increased cache misses.
CurrConnections: Monitor for unexpected spikes or drops.
EngineCPUUtilization: For Redis, this is a critical indicator of load.
ReplicationLag: For Redis (Cluster Mode Disabled) or Redis (Cluster Mode Enabled) with read replicas, monitor replication lag to ensure data consistency across nodes.
NewConnections: Monitor for excessive connection churn.
BytesUsedForCache: Track memory usage to prevent OOM errors.

Setting Up ElastiCache Alarms

Configure CloudWatch alarms for your ElastiCache Redis cluster:

Evictions: Alarm if Evictions exceeds a low threshold (e.g., 5 per minute) for 5 minutes. This suggests memory pressure.
CacheMisses: Alarm if CacheMisses rate significantly increases relative to CacheHits (e.g., miss rate > 50% for 10 minutes).
EngineCPUUtilization: Alarm if consistently above 80% for 5 minutes.
ReplicationLag: Alarm if ReplicationLag exceeds a few seconds (e.g., 5 seconds) for 2 minutes.
BytesUsedForCache: Alarm if BytesUsedForCache exceeds 90% of the allocated memory for 5 minutes.

Redis Performance Tuning and Monitoring Commands

While ElastiCache manages the infrastructure, you can still connect to your Redis instances (using `redis-cli`) to gather more granular insights, especially during troubleshooting.

Connecting to ElastiCache Redis

You’ll need the Redis endpoint and port from your ElastiCache console. Use `redis-cli` with TLS enabled if your cluster requires it.

# If TLS is enabled
redis-cli -h your-redis-endpoint.cache.amazonaws.com -p 6379 --tls

# If TLS is not enabled
redis-cli -h your-redis-endpoint.cache.amazonaws.com -p 6379

Key `redis-cli` Commands for Monitoring

Once connected, these commands provide real-time status:

INFO memory: Detailed memory usage statistics. Look at used_memory, used_memory_rss, mem_fragmentation_ratio. A fragmentation ratio significantly above 1.5 can indicate memory inefficiency.
INFO stats: Provides hit/miss ratios, commands processed, connections, etc.
INFO persistence: Relevant if RDB or AOF is enabled (less common with ElastiCache unless specifically configured).
MONITOR: (Use with extreme caution in production!) Streams all commands being executed. Useful for identifying slow or unexpected queries, but can heavily impact performance.
SLOWLOG GET 10: Retrieves the 10 slowest commands executed. Essential for identifying performance bottlenecks within Redis itself.
CLIENT LIST: Lists all connected clients, their state, and idle time. Helps identify stale or problematic connections.

You can script these commands to run periodically and push custom metrics to CloudWatch if ElastiCache’s built-in metrics aren’t sufficient. For example, a script could run redis-cli INFO memory, parse the output, and use `aws cloudwatch put-metric-data`.

Application Performance Monitoring (APM) for Magento 2

While server and infrastructure monitoring are crucial, understanding application-level performance bottlenecks within Magento 2 itself requires specialized tools.

Integrating APM Tools

Consider integrating an Application Performance Monitoring (APM) solution. Popular choices include:

New Relic
Datadog APM
Dynatrace
AWS X-Ray (for distributed tracing, can be integrated with other APM tools)

These tools provide deep insights into:

Transaction tracing: Identifying slow page loads, API calls, and background tasks.
Database query performance: Pinpointing inefficient SQL queries.
External service calls: Monitoring latency and errors from third-party integrations.
Code-level profiling: Pinpointing specific functions or methods causing performance issues.

For Magento 2, APM is invaluable for diagnosing issues that don’t manifest as simple CPU or memory spikes, such as slow database queries, inefficient third-party API calls, or poorly optimized Magento modules.

Alerting Strategy and Incident Response

Effective monitoring is only half the battle; a well-defined alerting and incident response strategy is critical.

Consolidating Alerts

Use AWS Simple Notification Service (SNS) to consolidate alerts from CloudWatch Alarms. Route these SNS topics to appropriate destinations:

Email: For less critical alerts or initial notifications.
PagerDuty/Opsgenie: For critical alerts requiring immediate attention and on-call rotation.
Slack/Microsoft Teams: For team-wide visibility and discussion.

Alerting Best Practices

Implement the following:

Actionable Alerts: Each alert should clearly state the problem, the affected service, and ideally, suggest remediation steps.
Avoid Alert Fatigue: Tune thresholds carefully. Use multi-condition alarms (e.g., CPU high AND error rate high) to reduce noise. Implement alert silencing during planned maintenance.
Severity Levels: Differentiate between critical (e.g., site down, Redis unavailable) and warning (e.g., high cache miss rate) alerts.
Runbooks: Maintain runbooks for common alert types, detailing step-by-step procedures for diagnosis and resolution.

Conclusion

A comprehensive monitoring strategy for Magento 2 on AWS involves a blend of infrastructure metrics, application-specific logs and custom metrics, web server performance, and robust Redis cluster health checks. By leveraging AWS CloudWatch, the CloudWatch Agent, and potentially APM tools, coupled with a well-defined alerting strategy, you can ensure the stability, performance, and availability of your Magento 2 application.