Server Monitoring Best Practices: Keeping Your Magento 2 App and Redis Clusters Alive on Google Cloud

Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts

Maintaining the health of your Redis clusters, especially in a distributed Magento 2 setup on Google Cloud, is paramount. Relying solely on basic CPU/memory metrics is insufficient. We need to actively probe Redis for its internal state, replication status, and potential bottlenecks. This involves leveraging `redis-cli` for immediate diagnostics and building custom scripts for automated, proactive checks.

A fundamental check is to ensure all nodes in a Redis cluster are reachable and functioning. We can achieve this by iterating through our known cluster nodes and executing a simple `PING` command. For a more robust check, we’ll also inspect the cluster’s overall state and the status of individual shards and replicas.

Cluster State Verification

The `CLUSTER INFO` command provides a wealth of information about the Redis cluster’s health. Key metrics include:

cluster_state: Should be ok.
cluster_slots_assigned, cluster_slots_ok, cluster_slots_pfail, cluster_slots_fail: These should ideally be equal, with pfail and fail being zero.
cluster_known_nodes: Should match the expected number of nodes.
cluster_size: The number of master nodes.

We can script this check to run periodically. Here’s a Python script that connects to a specified master node and executes `CLUSTER INFO`, then parses the output for critical indicators.

First, ensure you have the `redis-py` library installed: pip install redis.

Python Script for Cluster Health

This script connects to a given Redis master and checks its cluster status. It can be extended to iterate over all masters in a larger setup.

import redis
import sys

def check_redis_cluster_health(host='localhost', port=6379, password=None):
    try:
        r = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
        
        # Basic connectivity check
        r.ping()
        print(f"Successfully connected to Redis at {host}:{port}")

        # Cluster info check
        cluster_info = r.info('cluster')
        
        state = cluster_info.get('cluster_state')
        if state != 'ok':
            print(f"CRITICAL: Redis cluster state is '{state}' on {host}:{port}")
            return False

        slots_assigned = int(cluster_info.get('cluster_slots_assigned', 0))
        slots_ok = int(cluster_info.get('cluster_slots_ok', 0))
        slots_pfail = int(cluster_info.get('cluster_slots_pfail', 0))
        slots_fail = int(cluster_info.get('cluster_slots_fail', 0))
        known_nodes = int(cluster_info.get('cluster_known_nodes', 0))
        cluster_size = int(cluster_info.get('cluster_size', 0))

        print(f"Cluster State: {state}")
        print(f"Slots Assigned: {slots_assigned}, OK: {slots_ok}, PFAIL: {slots_pfail}, FAIL: {slots_fail}")
        print(f"Known Nodes: {known_nodes}, Cluster Size (Masters): {cluster_size}")

        if slots_pfail > 0 or slots_fail > 0:
            print(f"WARNING: {slots_pfail} slots in PFAIL state, {slots_fail} slots in FAIL state on {host}:{port}")
            # Depending on policy, this might be a warning or critical
            # return False 

        if slots_assigned != slots_ok:
            print(f"CRITICAL: Mismatch in assigned vs OK slots ({slots_assigned} vs {slots_ok}) on {host}:{port}")
            return False

        # Further checks could include:
        # - Replication status for each master (e.g., using CLUSTER NODES and checking slaveof/master_id)
        # - Latency checks (e.g., using SLOWLOG GET or measuring command execution time)

        return True

    except redis.exceptions.ConnectionError as e:
        print(f"ERROR: Could not connect to Redis at {host}:{port} - {e}")
        return False
    except redis.exceptions.TimeoutError as e:
        print(f"ERROR: Redis command timed out at {host}:{port} - {e}")
        return False
    except Exception as e:
        print(f"ERROR: An unexpected error occurred for {host}:{port} - {e}")
        return False

if __name__ == "__main__":
    # Example usage: Replace with your cluster's master node details
    # For a multi-master setup, you'd loop through a list of masters.
    redis_host = 'your-redis-master-0.your-redis-cluster.your-gcp-project.redis.googleusercontent.com' # Example for Memorystore
    redis_port = 6379
    redis_password = None # Set if your Redis instance requires a password

    if not check_redis_cluster_health(redis_host, redis_port, redis_password):
        sys.exit(1) # Exit with a non-zero status code to indicate failure
    else:
        sys.exit(0) # Exit with zero status code for success

Replication Status Monitoring

For Redis Sentinel or Redis Cluster with replicas, ensuring replication is healthy is crucial for high availability and failover. We can use the `CLUSTER NODES` command to get a detailed list of all nodes in the cluster, their roles, and their replication status.

A Python script can parse the output of `CLUSTER NODES` to verify that each master has at least one replica and that replicas are connected and synchronized.

import redis
import sys

def check_redis_replication_health(host='localhost', port=6379, password=None):
    try:
        r = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
        r.ping() # Ensure connection

        nodes_info = r.execute_command('CLUSTER NODES')
        
        masters = {}
        replicas = {}
        
        for line in nodes_info.strip().split('\n'):
            parts = line.split()
            node_id = parts[0]
            ip_port = parts[1]
            flags = parts[2]
            master_id = parts[3]
            ping_sent = parts[4]
            ping_recv = parts[5]
            config_epoch = parts[6]
            link_state = parts[7]
            
            node_data = {
                'id': node_id,
                'ip_port': ip_port,
                'flags': flags,
                'master_id': master_id,
                'link_state': link_state
            }

            if 'master' in flags:
                masters[node_id] = node_data
            elif 'slave' in flags:
                replicas[node_id] = node_data

        print(f"Found {len(masters)} masters and {len(replicas)} replicas.")

        all_healthy = True

        # Check if each master has at least one replica
        for master_id, master_data in masters.items():
            has_replica = False
            for replica_id, replica_data in replicas.items():
                if replica_data['master_id'] == master_id:
                    has_replica = True
                    if replica_data['link_state'] != 'connected':
                        print(f"WARNING: Replica {replica_data['ip_port']} (ID: {replica_id}) for master {master_data['ip_port']} is not connected (Link State: {replica_data['link_state']}).")
                        all_healthy = False
                    # More advanced: check replication lag if possible via INFO replication
            if not has_replica:
                print(f"CRITICAL: Master {master_data['ip_port']} (ID: {master_id}) has no replicas.")
                all_healthy = False
        
        # Check if all replicas are connected to a master
        for replica_id, replica_data in replicas.items():
            if replica_data['master_id'] not in masters:
                print(f"CRITICAL: Replica {replica_data['ip_port']} (ID: {replica_id}) is pointing to an unknown master (ID: {replica_data['master_id']}).")
                all_healthy = False
            if replica_data['link_state'] != 'connected':
                 print(f"WARNING: Replica {replica_data['ip_port']} (ID: {replica_id}) is not connected (Link State: {replica_data['link_state']}).")
                 all_healthy = False

        if all_healthy:
            print("Redis replication status appears healthy.")
            return True
        else:
            print("Redis replication health issues detected.")
            return False

    except redis.exceptions.ConnectionError as e:
        print(f"ERROR: Could not connect to Redis at {host}:{port} - {e}")
        return False
    except Exception as e:
        print(f"ERROR: An unexpected error occurred for {host}:{port} - {e}")
        return False

if __name__ == "__main__":
    # Example usage: Replace with your cluster's master node details
    redis_host = 'your-redis-master-0.your-redis-cluster.your-gcp-project.redis.googleusercontent.com' # Example for Memorystore
    redis_port = 6379
    redis_password = None # Set if your Redis instance requires a password

    if not check_redis_replication_health(redis_host, redis_port, redis_password):
        sys.exit(1)
    else:
        sys.exit(0)

Magento 2 Application Monitoring: Beyond Basic Metrics

Magento 2 applications are complex, with many moving parts. Effective monitoring requires looking beyond simple HTTP 200 status codes and CPU utilization. We need to monitor application-specific metrics, error rates, and performance indicators.

Error Tracking and Logging

Centralized logging is non-negotiable. Tools like Cloud Logging (formerly Stackdriver) on Google Cloud are essential. However, simply collecting logs isn’t enough; we need to parse them for specific Magento errors and set up alerts.

Magento 2 logs errors to var/log/system.log and var/log/exception.log. We should configure Cloud Logging agents (like the Ops Agent) to collect these files and then create log-based metrics and alerts within Cloud Monitoring.

For example, to alert on critical PHP errors, you might create a log-based metric in Cloud Logging that counts occurrences of lines containing `PHP Fatal error:` or `PHP Parse error:`. Then, set up an alert policy on this metric.

Application Performance Monitoring (APM)

For deep insights into request latency, database query times, and external service calls, an APM solution is invaluable. Google Cloud’s operations suite offers APM capabilities, or you can integrate third-party tools like New Relic, Datadog, or Sentry.

Key metrics to track include:

Average Request Latency (overall and per endpoint)
Error Rate (HTTP 5xx, 4xx)
Database Query Performance (average time, slow queries)
External Service Call Latency and Error Rates
Cache Hit/Miss Ratios (for Magento’s internal cache and Redis)

Custom Application Metrics with Prometheus/OpenMetrics

You can expose custom metrics directly from your Magento application using libraries that adhere to the OpenMetrics standard, which Prometheus scrapes. This allows you to monitor business-specific KPIs or application states not covered by standard APM tools.

A common approach is to use a PHP library like prometheus_client_php. You would instrument your code to increment counters or record gauges for specific events.

Example: Tracking Redis Cache Operations

Let’s say you want to track Redis cache hits and misses directly within your Magento application. You’d modify your cache retrieval logic.

<?php
require 'vendor/autoload.php'; // Assuming you installed prometheus_client_php via Composer

use Prometheus\CollectorRegistry;
use Prometheus\Render\RenderText;
use Prometheus\Storage\InMemory;

// Initialize registry and storage (use Redis or APCu for production persistence)
$registry = new CollectorRegistry(new InMemory());

// Define metrics
$cache_hits = $registry->registerCounter('magento', 'cache_hits', 'Number of cache hits', ['type']);
$cache_misses = $registry->registerCounter('magento', 'cache_misses', 'Number of cache misses', ['type']);

// --- Your Magento Cache Retrieval Logic ---
function get_from_redis_cache($key, $cache_type = 'default') {
    global $registry, $cache_hits, $cache_misses;

    // Assume $redis_client is your connected Redis client instance
    // $redis_client = new Redis(); $redis_client->connect(...);

    $value = $redis_client->get($key);

    if ($value !== false) {
        // Cache Hit
        $cache_hits->inc(['type' => $cache_type]);
        return $value;
    } else {
        // Cache Miss
        $cache_misses->inc(['type' => $cache_type]);
        return null; // Or trigger cache population
    }
}

// --- Endpoint to expose metrics ---
// This would typically be a separate script or a dedicated route in your framework
if ($_SERVER['REQUEST_URI'] === '/metrics') {
    header('Content-type: text/plain');
    $renderer = new RenderText();
    echo $renderer->render($registry->getMetricFamilySamples());
    exit;
}

// --- Example Usage within Magento (simplified) ---
// $cached_data = get_from_redis_cache('my_product_data_123', 'page_cache');
// if ($cached_data === null) {
//     // Populate cache...
//     $redis_client->set('my_product_data_123', $new_data, 3600); // Cache for 1 hour
// }

// --- To run this example locally for testing ---
// echo "Simulating cache operations...\n";
// get_from_redis_cache('test_key_1', 'data'); // Miss
// get_from_redis_cache('test_key_1', 'data'); // Hit
// get_from_redis_cache('test_key_2', 'data'); // Miss
// get_from_redis_cache('test_key_2', 'data'); // Hit
// get_from_redis_cache('test_key_2', 'data'); // Hit

// echo "\n--- Metrics ---\n";
// $renderer = new RenderText();
// echo $renderer->render($registry->getMetricFamilySamples());
?>

You would then configure Prometheus (or Google Cloud’s Managed Service for Prometheus) to scrape the /metrics endpoint of your Magento application. This provides granular visibility into application-level performance.

Google Cloud Infrastructure Monitoring

Leveraging Google Cloud’s native monitoring tools is essential for understanding the health of your underlying infrastructure.

Compute Engine (GCE) and GKE Monitoring

For GCE instances running your Magento app, monitor key metrics:

compute.googleapis.com/instance/cpu/utilization: CPU usage.
compute.googleapis.com/instance/memory/usage: Memory usage (requires Ops Agent or custom metric collection).
compute.googleapis.com/instance/network/received_bytes_count and transmitted_bytes_count: Network traffic.
compute.googleapis.com/instance/disk/read_ops_count and write_ops_count: Disk I/O operations.
compute.googleapis.com/instance/disk/bytes_read and bytes_written: Disk throughput.

For Google Kubernetes Engine (GKE), monitor cluster-level and node-level metrics:

container.googleapis.com/pod/cpu/utilization
container.googleapis.com/pod/memory/usage
container.googleapis.com/container/network/received_bytes_count
kubernetes.io/node/cpu/utilization
kubernetes.io/node/memory/utilization

Set up alerting policies in Cloud Monitoring for thresholds on these metrics. For instance, trigger an alert if CPU utilization consistently exceeds 85% for 15 minutes, or if memory usage approaches critical levels.

Cloud SQL / Memorystore Monitoring

For managed database services like Cloud SQL (if used for Magento’s primary DB) or Memorystore for Redis, Cloud Monitoring provides pre-built dashboards and metrics.

Key Memorystore metrics to watch:

redis.googleapis.com/stats/connected_clients: Number of active client connections.
redis.googleapis.com/stats/commands_processed: Rate of commands processed.
redis.googleapis.com/stats/evicted_keys: Number of keys evicted due to memory limits.
redis.googleapis.com/stats/keyspace_hits and keyspace_misses: Cache hit/miss ratio.
redis.googleapis.com/memory/usage: Memory usage.

For Cloud SQL, monitor metrics like CPU utilization, memory utilization, disk I/O, network traffic, active connections, and query latency.

Load Balancer Monitoring

Google Cloud Load Balancers (HTTP(S), TCP/SSL Proxy) are critical for distributing traffic. Monitor their performance:

loadbalancing.googleapis.com/https/request_count
loadbalancing.googleapis.com/https/response_code_count (broken down by code: 2xx, 3xx, 4xx, 5xx)
loadbalancing.googleapis.com/https/backend_latencies
loadbalancing.googleapis.com/https/backend_connection_close_count

Pay close attention to the 5xx response code count and backend latencies. Spikes here often indicate issues with your backend Magento application instances or Redis.

Alerting Strategy and Best Practices

A robust alerting strategy is the culmination of effective monitoring. It should be actionable, minimize noise, and prioritize critical issues.

Define Alerting Tiers

Categorize alerts based on severity:

Critical: Immediate action required. System outage, major data loss risk, severe performance degradation impacting all users. (e.g., Redis cluster down, Magento 5xx error rate > 10%).
Warning: Investigate soon. Potential for future issues, minor performance impact, non-critical component failure. (e.g., Redis node in PFAIL state, high but not critical error rate, low disk space).
Informational: For awareness. Routine events, capacity planning insights. (e.g., High traffic periods, cache warming).

Actionable Alerting

Each alert should have a clear owner and a documented runbook or escalation procedure. Avoid alerts that simply state a metric is high without context or a clear next step. For example, an alert for “Redis memory usage > 90%” should link to a runbook detailing how to investigate memory leaks, scale Redis, or clear cache.

Leverage Cloud Monitoring Notification Channels

Configure Google Cloud Monitoring to send notifications via:

Email
PagerDuty
Slack
Pub/Sub (for custom webhook integrations)

For critical alerts, PagerDuty or similar on-call management tools are essential. For less critical issues, Slack notifications can be sufficient.

Regular Review and Tuning

Monitoring and alerting systems are not “set and forget.” Regularly review alert thresholds, false positives, and the effectiveness of your runbooks. As your Magento application evolves and scales, your monitoring strategy must adapt.