Server Monitoring Best Practices: Keeping Your Magento 2 App and Redis Clusters Alive on DigitalOcean

Proactive Redis Cluster Health Checks with `redis-cli`

Maintaining the health of your Redis clusters is paramount for Magento 2 performance. Beyond basic connectivity, we need to monitor key metrics that indicate potential bottlenecks or impending failures. This involves leveraging `redis-cli` for direct introspection and integrating these checks into a robust monitoring system like Prometheus or Nagios.

A fundamental check is the cluster’s overall status. For Redis Sentinel, this means querying the master and its replicas. For Redis Cluster, it’s about ensuring all nodes are in `ok` state and that the hash slots are fully covered.

Sentinel Cluster Status

Connect to a Sentinel instance and query the master’s status. This command will return the current master’s IP and port, and importantly, the number of Sentinels monitoring it and the number of replicas it has.

redis-cli -h  -p  SENTINEL master

The output will look something like this:

1) "name"
2) ""
3) "ip"
4) "10.10.0.5"
5) "port"
6) "6379"
7) "runid"
8) "..."
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ping-reply"
16) "0"
17) "down-after-milliseconds"
18) "5000"
19) "failover-timeout"
20) "10000"
21) "parallel-syncs"
22) "1"
23) "master-host"
24) "10.10.0.5"
25) "master-port"
26) "6379"
27) "replication-offset"
28) "123456789"
29) "master-link-down-since-seconds"
30) "0"
31) "master-link-status"
32) "up"
33) "slave-priority"
34) "100"
35) "replica-count"
36) "2"
37) "flags-group-by-role"
38) "master"
39) "num-slaves"
40) "2"
41) "num-other-sentinels"
42) "2"

Key metrics to monitor from this output for alerting:

master-link-status: Should always be up. If it’s down, there’s a replication issue.
replica-count: Should match your expected number of replicas. A decrease indicates a replica has gone offline or is failing to connect.
num-other-sentinels: Should match your expected number of Sentinels. A decrease means a Sentinel has been removed or is unreachable.

To check the health of individual replicas, use the SENTINEL replicas <master-name> command.

redis-cli -h  -p  SENTINEL replicas

This will list each replica, its status, replication lag, and connection state. Monitor master-link-down-since-seconds for each replica; a non-zero value indicates a problem.

Redis Cluster Node Status

For Redis Cluster, the primary tool is redis-cli --cluster check. This command connects to all nodes in the cluster and verifies their state, including hash slot distribution and connectivity.

redis-cli --cluster check :

A healthy cluster will report:

...
[OK] All nodes agree about slots configuration.
...
[OK] All masters have at least one replica.
...
[OK] All slaves are in sync with their master.
...
[OK] All 16384 slots covered.
...

Any deviation from these `[OK]` messages indicates a problem. For example, “[ERR] N of slots covered” means some keys might be unreachable. “[ERR] Some slaves are not in sync” points to replication lag or failure.

To get a quick overview of all nodes and their roles, use CLUSTER NODES.

redis-cli -h  -p  CLUSTER NODES

Look for nodes marked as `master` or `slave`. Ensure all masters have at least one slave and that no node is marked as `disconnected` or `fail`.

Magento 2 Application Performance Monitoring (APM) with New Relic

While Redis health is crucial, understanding how your Magento 2 application interacts with it and other services is equally important. New Relic is a powerful APM tool that provides deep insights into transaction traces, database queries, external service calls, and error rates.

Key Magento 2 Metrics to Track

Transaction Traces: Identify slow-loading pages, problematic API endpoints, or inefficient cron jobs. Look for transactions exceeding your SLOs (Service Level Objectives).
Database Queries: Pinpoint slow SQL queries, especially those executed within Magento’s EAV model or complex catalog operations.
External Services: Monitor latency and error rates for calls to third-party APIs (payment gateways, shipping providers, etc.) and internal services like Redis.
Error Rates: Track PHP errors, exceptions, and HTTP 5xx errors. Set up alerts for spikes in error frequency.
Throughput: Monitor requests per minute (RPM) to understand traffic patterns and identify potential load issues.
Response Time: Track average and percentile response times (e.g., p95, p99) to gauge user experience.

Configuring New Relic for Magento 2

The New Relic PHP agent is typically installed via PECL or by downloading the agent archive. For DigitalOcean droplets, this often involves SSHing into the server and following the official New Relic installation guide.

After installation, you’ll need to configure the `newrelic.ini` file. This file is usually located in your PHP configuration directory (e.g., `/etc/php/8.1/cli/conf.d/` or `/etc/php/8.1/fpm/conf.d/`).

[newrelic]
; Required: Your New Relic license key
license = "YOUR_LICENSE_KEY"

; Required: The application name that will appear in the New Relic UI
appname = "Magento2-Production-Web"

; Optional: Enable/disable specific features
; enable_auto_instrumentation = true
; high_security = false
; transaction_tracer.enabled = true
; transaction_tracer.threshold = "10ms" ; Alert on transactions slower than 10ms
; error_collector.enabled = true
; capture_errors_for_unknown_classes = true
; log_level = "info"
; log_file = "/var/log/newrelic/php_agent.log"

; For Magento 2, ensure framework integration is enabled
framework = "magento"
framework.magento.logging = true
framework.magento.transaction_name = "request_uri" ; Or "uri" for cleaner names

After modifying `newrelic.ini`, restart your PHP-FPM service and web server (e.g., Nginx or Apache) for the changes to take effect.

sudo systemctl restart php8.1-fpm
sudo systemctl restart nginx

Alerting on Key APM Metrics

Within the New Relic UI, navigate to the “Alerts & AI” section. Create NRQL (New Relic Query Language) alerts for critical Magento 2 performance indicators.

Example NRQL for High Error Rate:

SELECT count(*) FROM TransactionError WHERE appName = 'Magento2-Production-Web' SINCE 5 minutes ago

Set a threshold (e.g., > 10 errors in 5 minutes) to trigger an alert.

Example NRQL for Slow Transactions:

SELECT average(duration) FROM Transaction WHERE appName = 'Magento2-Production-Web' SINCE 5 minutes ago

Alert if the average duration exceeds your SLO (e.g., > 2 seconds).

Example NRQL for Redis Latency (if Redis is instrumented):

SELECT average(newrelic.timeslice.value) FROM Transaction WHERE appName = 'Magento2-Production-Web' AND `external.name` = 'Redis/GET' SINCE 5 minutes ago

Monitor the average time spent on Redis GET operations. High latency here directly impacts page load times.

DigitalOcean Droplet Resource Monitoring with `node_exporter` and Prometheus

While APM tools focus on application-level performance, it’s crucial to monitor the underlying infrastructure. For DigitalOcean droplets, this means tracking CPU, memory, disk I/O, and network traffic. Prometheus, coupled with `node_exporter`, is a standard for this type of metric collection.

Setting up `node_exporter`

Download the latest `node_exporter` binary for your droplet’s architecture from the official Prometheus releases page. For example, on an Ubuntu droplet:

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

Create a systemd service file to manage `node_exporter`.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Save this as `/etc/systemd/system/node_exporter.service` and then enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Verify that `node_exporter` is running and exposing metrics on port 9100:

curl http://localhost:9100/metrics

Configuring Prometheus to Scrape Droplets

In your Prometheus configuration file (e.g., `/etc/prometheus/prometheus.yml`), add a scrape job for your Magento 2 droplets.

scrape_configs:
  - job_name: 'magento_droplets'
    static_configs:
      - targets: [':9100', ':9100', ':9100']
        labels:
          instance: 'magento-web-01'
      - targets: [':9100']
        labels:
          instance: 'magento-redis-master-01'
      - targets: [':9100']
        labels:
          instance: 'magento-redis-replica-01'

Reload Prometheus configuration for the changes to take effect.

sudo systemctl reload prometheus

Key Droplet Metrics for Alerting

Use Prometheus Alertmanager to define alerts based on these metrics:

CPU Usage: High CPU can indicate inefficient code, traffic spikes, or resource contention.
Memory Usage: Running out of memory leads to OOM killer events and application instability.
Disk I/O Wait: High I/O wait times suggest storage bottlenecks, often exacerbated by slow disk performance or excessive database activity.
Network Traffic: Sudden spikes or drops can indicate network issues or unusual traffic patterns.
Filesystem Usage: Ensure logs and temporary directories don’t fill up the disk.

Example Alert Rule (in Alertmanager’s rules file):

groups:
- name: magento_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage on {{ $labels.instance }} is above 90% for the last 10 minutes."

  - alert: LowMemoryAvailable
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Low memory available on {{ $labels.instance }}"
      description: "Only {{ $value | printf \"%.2f\" }}% of memory is available on {{ $labels.instance }}."

  - alert: HighDiskIOWait
    expr: avg by (instance) (rate(node_disk_io_time_seconds_total{device=~"sd[a-z]+[0-9]*"}[5m])) > 0.8
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High disk I/O wait on {{ $labels.instance }}"
      description: "Disk I/O wait time on {{ $labels.instance }} is consistently high."

Integrating Redis Metrics into Prometheus

To get Redis-specific metrics into Prometheus, we can use the official Redis Exporter or leverage `redis-cli` within a custom exporter script. The Redis Exporter is generally preferred for its comprehensive metric set.

Using Redis Exporter

Download and run the Redis Exporter binary. It can be configured to connect to your Redis master, replicas, or Sentinel instances.

# Download and extract (example for Linux AMD64)
wget https://github.com/oliver006/redis_exporter/releases/download/v1.50.0/redis_exporter-v1.50.0.linux-amd64.tar.gz
tar xvfz redis_exporter-v1.50.0.linux-amd64.tar.gz
sudo mv redis_exporter-v1.50.0.linux-amd64/redis_exporter /usr/local/bin/

# Create systemd service
cat <<EOF | sudo tee /etc/systemd/system/redis_exporter.service
[Unit]
Description=Redis Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/redis_exporter --redis.addr=redis://:6379 --redis.password=
# For Sentinel:
# ExecStart=/usr/local/bin/redis_exporter --redis.addr=redis-sentinel://:26379 --redis.sentinel.master= --redis.password=

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable redis_exporter
sudo systemctl start redis_exporter

Add a scrape job to your Prometheus configuration:

scrape_configs:
  - job_name: 'redis_cluster'
    static_configs:
      - targets: [':9121'] # Default port for redis_exporter
        labels:
          instance: 'magento-redis-master'
  # Add similar jobs for replicas if using separate exporters or a single exporter configured for them

Key Redis Metrics for Alerting

redis_up: Should be 1. If 0, the exporter can’t connect to Redis.
redis_connected_clients: Monitor for excessive client connections.
redis_instantaneous_ops_per_sec: Track command throughput.
redis_memory_used_bytes: Monitor memory consumption against limits.
redis_evicted_keys: Indicates memory pressure and data loss.
redis_replication_lag_seconds: Crucial for replicas to ensure they are in sync.
redis_commands_processed_total: Total commands processed by the server.
redis_keyspace_hits_total and redis_keyspace_misses_total: Calculate hit rate (hits / (hits + misses)). A low hit rate might indicate insufficient memory or inefficient caching.

Example Alert Rule for Redis Replication Lag:

groups:
- name: redis_alerts
  rules:
  - alert: RedisReplicationLagging
    expr: redis_replication_lag_seconds{job="redis_cluster"} > 60 # Lagging by more than 60 seconds
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis replication lag on {{ $labels.instance }}"
      description: "Redis replica {{ $labels.instance }} is lagging by {{ $value | printf \"%.2f\" }} seconds."

  - alert: HighRedisMemoryUsage
    expr: redis_memory_used_bytes{job="redis_cluster"} / (1024*1024*1024) > 0.9 # 90% of 1GB limit
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High Redis memory usage on {{ $labels.instance }}"
      description: "Redis memory usage on {{ $labels.instance }} is {{ $value | printf \"%.2f\" }} GB, exceeding 90% of its limit."

Centralized Logging with ELK Stack (Elasticsearch, Logstash, Kibana)

Aggregating logs from all your Magento 2 application servers, Redis nodes, and load balancers into a central location is essential for debugging and auditing. The ELK stack is a robust solution for this.

Log Shipping with Filebeat

Filebeat is a lightweight shipper that forwards log files from your servers to Logstash or Elasticsearch. Install Filebeat on each server (Magento app, Redis, Nginx, etc.).

# Example for Ubuntu/Debian
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.10.4-amd64.deb
sudo dpkg -i filebeat-8.10.4-amd64.deb

Configure Filebeat to collect Magento, Nginx, and Redis logs. Edit `/etc/filebeat/filebeat.yml`:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/*.log
  fields_under_root: true
  fields:
    log_type: nginx

- type: log
  enabled: true
  paths:
    - /var/www/html/magento2/var/log/*.log # Adjust path as needed
  fields_under_root: true
  fields:
    log_type: magento

- type: log
  enabled: true
  paths:
    - /var/log/redis/redis-server.log # Adjust path if using a different log location
  fields_under_root: true
  fields:
    log_type: redis

output.logstash:
  hosts: [":5044"] # Or Elasticsearch output directly

# If sending directly to Elasticsearch:
# output.elasticsearch:
#   hosts: [":9200"]
#   index: "filebeat-%{[agent.version]}-%{+yyyy.MM.dd}"

Enable and start Filebeat:

sudo systemctl enable filebeat
sudo systemctl start filebeat

Log Processing with Logstash

Logstash will receive logs from Filebeat, parse them, enrich them, and send them to Elasticsearch. Create a Logstash pipeline configuration (e.g., `/etc/logstash/conf.d/02-magento-pipeline.conf`):

input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][log_type] == "nginx" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
    }
    mutate {
      convert => { "response" => "integer" }
      convert => { "bytes" => "integer" }
      convert => { "response_time" => "float" }
    }
  }

  if [fields][log_type] == "magento" {
    # Magento logs can be complex. Use grok or JSON filter if logs are structured.
    # Example for a simple error log line:
    grok {
      match => { "message" => "\[%{TIMESTAMP_ISO8601:timestamp}\] %{LOGLEVEL:log_level}: %{GREEDYDATA:magento_message}" }
    }
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
    }
  }

  if [fields][log_type] == "redis" {
    # Redis logs are typically less structured, focus on keywords
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{NUMBER:process_id}\:%{DATA:thread_id}\] %{LOGLEVEL:log_level} %{GREEDYDATA:redis_message}" }
    }
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
    }
  }

  # Common fields for all log types
  mutate {
    rename => { "message" => "original_message" }
    rename => { "[fields][log_type]" => "log_type" }
  }
}

output {
  elasticsearch {
    hosts => [":9200"]
    index => "%{[fields][log_type]}-%{+yyyy.MM.dd}"
  }
}

Log Analysis with Kibana

Use Kibana to visualize and search your logs. Create index patterns for each log type (e.g., `nginx-*`, `magento-*`, `redis-*`).

Key Kibana Dashboards/Visualizations:

Nginx Access Logs: Visualize traffic volume, top IP addresses, response codes (especially 4xx and 5xx), and slow requests.
Magento Logs: Filter for `ERROR` or `FATAL` log levels to quickly identify application issues. Search for specific exceptions or error messages.
Redis Logs: Monitor for warnings, errors, or specific events like `eviction` or `failover`.
Correlated Views: Create dashboards that combine Nginx, Magento, and Redis logs, allowing you to trace a request from the web server through the application and to the cache layer.

By implementing these layered monitoring strategies—from infrastructure metrics to application traces and centralized logging—you can build a resilient and observable Magento 2 environment on DigitalOcean, ensuring high availability and performance for your e-commerce platform.