Server Monitoring Best Practices: Keeping Your Magento 2 App and Redis Clusters Alive on Google Cloud
Proactive Redis Cluster Health Checks with `redis-cli` and Custom Scripts
Maintaining the health of your Redis clusters, especially in a distributed Magento 2 setup on Google Cloud, is paramount. Relying solely on basic CPU/memory metrics is insufficient. We need to actively probe Redis for its internal state, replication status, and potential bottlenecks. This involves leveraging `redis-cli` for immediate diagnostics and building custom scripts for automated, proactive checks.
A fundamental check is to ensure all nodes in a Redis cluster are reachable and functioning. We can achieve this by iterating through our known cluster nodes and executing a simple `PING` command. For a more robust check, we’ll also inspect the cluster’s overall state and the status of individual shards and replicas.
Cluster State Verification
The `CLUSTER INFO` command provides a wealth of information about the Redis cluster’s health. Key metrics include:
cluster_state: Should beok.cluster_slots_assigned,cluster_slots_ok,cluster_slots_pfail,cluster_slots_fail: These should ideally be equal, withpfailandfailbeing zero.cluster_known_nodes: Should match the expected number of nodes.cluster_size: The number of master nodes.
We can script this check to run periodically. Here’s a Python script that connects to a specified master node and executes `CLUSTER INFO`, then parses the output for critical indicators.
First, ensure you have the `redis-py` library installed: pip install redis.
Python Script for Cluster Health
This script connects to a given Redis master and checks its cluster status. It can be extended to iterate over all masters in a larger setup.
import redis
import sys
def check_redis_cluster_health(host='localhost', port=6379, password=None):
try:
r = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
# Basic connectivity check
r.ping()
print(f"Successfully connected to Redis at {host}:{port}")
# Cluster info check
cluster_info = r.info('cluster')
state = cluster_info.get('cluster_state')
if state != 'ok':
print(f"CRITICAL: Redis cluster state is '{state}' on {host}:{port}")
return False
slots_assigned = int(cluster_info.get('cluster_slots_assigned', 0))
slots_ok = int(cluster_info.get('cluster_slots_ok', 0))
slots_pfail = int(cluster_info.get('cluster_slots_pfail', 0))
slots_fail = int(cluster_info.get('cluster_slots_fail', 0))
known_nodes = int(cluster_info.get('cluster_known_nodes', 0))
cluster_size = int(cluster_info.get('cluster_size', 0))
print(f"Cluster State: {state}")
print(f"Slots Assigned: {slots_assigned}, OK: {slots_ok}, PFAIL: {slots_pfail}, FAIL: {slots_fail}")
print(f"Known Nodes: {known_nodes}, Cluster Size (Masters): {cluster_size}")
if slots_pfail > 0 or slots_fail > 0:
print(f"WARNING: {slots_pfail} slots in PFAIL state, {slots_fail} slots in FAIL state on {host}:{port}")
# Depending on policy, this might be a warning or critical
# return False
if slots_assigned != slots_ok:
print(f"CRITICAL: Mismatch in assigned vs OK slots ({slots_assigned} vs {slots_ok}) on {host}:{port}")
return False
# Further checks could include:
# - Replication status for each master (e.g., using CLUSTER NODES and checking slaveof/master_id)
# - Latency checks (e.g., using SLOWLOG GET or measuring command execution time)
return True
except redis.exceptions.ConnectionError as e:
print(f"ERROR: Could not connect to Redis at {host}:{port} - {e}")
return False
except redis.exceptions.TimeoutError as e:
print(f"ERROR: Redis command timed out at {host}:{port} - {e}")
return False
except Exception as e:
print(f"ERROR: An unexpected error occurred for {host}:{port} - {e}")
return False
if __name__ == "__main__":
# Example usage: Replace with your cluster's master node details
# For a multi-master setup, you'd loop through a list of masters.
redis_host = 'your-redis-master-0.your-redis-cluster.your-gcp-project.redis.googleusercontent.com' # Example for Memorystore
redis_port = 6379
redis_password = None # Set if your Redis instance requires a password
if not check_redis_cluster_health(redis_host, redis_port, redis_password):
sys.exit(1) # Exit with a non-zero status code to indicate failure
else:
sys.exit(0) # Exit with zero status code for success
Replication Status Monitoring
For Redis Sentinel or Redis Cluster with replicas, ensuring replication is healthy is crucial for high availability and failover. We can use the `CLUSTER NODES` command to get a detailed list of all nodes in the cluster, their roles, and their replication status.
A Python script can parse the output of `CLUSTER NODES` to verify that each master has at least one replica and that replicas are connected and synchronized.
import redis
import sys
def check_redis_replication_health(host='localhost', port=6379, password=None):
try:
r = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True)
r.ping() # Ensure connection
nodes_info = r.execute_command('CLUSTER NODES')
masters = {}
replicas = {}
for line in nodes_info.strip().split('\n'):
parts = line.split()
node_id = parts[0]
ip_port = parts[1]
flags = parts[2]
master_id = parts[3]
ping_sent = parts[4]
ping_recv = parts[5]
config_epoch = parts[6]
link_state = parts[7]
node_data = {
'id': node_id,
'ip_port': ip_port,
'flags': flags,
'master_id': master_id,
'link_state': link_state
}
if 'master' in flags:
masters[node_id] = node_data
elif 'slave' in flags:
replicas[node_id] = node_data
print(f"Found {len(masters)} masters and {len(replicas)} replicas.")
all_healthy = True
# Check if each master has at least one replica
for master_id, master_data in masters.items():
has_replica = False
for replica_id, replica_data in replicas.items():
if replica_data['master_id'] == master_id:
has_replica = True
if replica_data['link_state'] != 'connected':
print(f"WARNING: Replica {replica_data['ip_port']} (ID: {replica_id}) for master {master_data['ip_port']} is not connected (Link State: {replica_data['link_state']}).")
all_healthy = False
# More advanced: check replication lag if possible via INFO replication
if not has_replica:
print(f"CRITICAL: Master {master_data['ip_port']} (ID: {master_id}) has no replicas.")
all_healthy = False
# Check if all replicas are connected to a master
for replica_id, replica_data in replicas.items():
if replica_data['master_id'] not in masters:
print(f"CRITICAL: Replica {replica_data['ip_port']} (ID: {replica_id}) is pointing to an unknown master (ID: {replica_data['master_id']}).")
all_healthy = False
if replica_data['link_state'] != 'connected':
print(f"WARNING: Replica {replica_data['ip_port']} (ID: {replica_id}) is not connected (Link State: {replica_data['link_state']}).")
all_healthy = False
if all_healthy:
print("Redis replication status appears healthy.")
return True
else:
print("Redis replication health issues detected.")
return False
except redis.exceptions.ConnectionError as e:
print(f"ERROR: Could not connect to Redis at {host}:{port} - {e}")
return False
except Exception as e:
print(f"ERROR: An unexpected error occurred for {host}:{port} - {e}")
return False
if __name__ == "__main__":
# Example usage: Replace with your cluster's master node details
redis_host = 'your-redis-master-0.your-redis-cluster.your-gcp-project.redis.googleusercontent.com' # Example for Memorystore
redis_port = 6379
redis_password = None # Set if your Redis instance requires a password
if not check_redis_replication_health(redis_host, redis_port, redis_password):
sys.exit(1)
else:
sys.exit(0)
Magento 2 Application Monitoring: Beyond Basic Metrics
Magento 2 applications are complex, with many moving parts. Effective monitoring requires looking beyond simple HTTP 200 status codes and CPU utilization. We need to monitor application-specific metrics, error rates, and performance indicators.
Error Tracking and Logging
Centralized logging is non-negotiable. Tools like Cloud Logging (formerly Stackdriver) on Google Cloud are essential. However, simply collecting logs isn’t enough; we need to parse them for specific Magento errors and set up alerts.
Magento 2 logs errors to var/log/system.log and var/log/exception.log. We should configure Cloud Logging agents (like the Ops Agent) to collect these files and then create log-based metrics and alerts within Cloud Monitoring.
For example, to alert on critical PHP errors, you might create a log-based metric in Cloud Logging that counts occurrences of lines containing `PHP Fatal error:` or `PHP Parse error:`. Then, set up an alert policy on this metric.
Application Performance Monitoring (APM)
For deep insights into request latency, database query times, and external service calls, an APM solution is invaluable. Google Cloud’s operations suite offers APM capabilities, or you can integrate third-party tools like New Relic, Datadog, or Sentry.
Key metrics to track include:
- Average Request Latency (overall and per endpoint)
- Error Rate (HTTP 5xx, 4xx)
- Database Query Performance (average time, slow queries)
- External Service Call Latency and Error Rates
- Cache Hit/Miss Ratios (for Magento’s internal cache and Redis)
Custom Application Metrics with Prometheus/OpenMetrics
You can expose custom metrics directly from your Magento application using libraries that adhere to the OpenMetrics standard, which Prometheus scrapes. This allows you to monitor business-specific KPIs or application states not covered by standard APM tools.
A common approach is to use a PHP library like prometheus_client_php. You would instrument your code to increment counters or record gauges for specific events.
Example: Tracking Redis Cache Operations
Let’s say you want to track Redis cache hits and misses directly within your Magento application. You’d modify your cache retrieval logic.
<?php
require 'vendor/autoload.php'; // Assuming you installed prometheus_client_php via Composer
use Prometheus\CollectorRegistry;
use Prometheus\Render\RenderText;
use Prometheus\Storage\InMemory;
// Initialize registry and storage (use Redis or APCu for production persistence)
$registry = new CollectorRegistry(new InMemory());
// Define metrics
$cache_hits = $registry->registerCounter('magento', 'cache_hits', 'Number of cache hits', ['type']);
$cache_misses = $registry->registerCounter('magento', 'cache_misses', 'Number of cache misses', ['type']);
// --- Your Magento Cache Retrieval Logic ---
function get_from_redis_cache($key, $cache_type = 'default') {
global $registry, $cache_hits, $cache_misses;
// Assume $redis_client is your connected Redis client instance
// $redis_client = new Redis(); $redis_client->connect(...);
$value = $redis_client->get($key);
if ($value !== false) {
// Cache Hit
$cache_hits->inc(['type' => $cache_type]);
return $value;
} else {
// Cache Miss
$cache_misses->inc(['type' => $cache_type]);
return null; // Or trigger cache population
}
}
// --- Endpoint to expose metrics ---
// This would typically be a separate script or a dedicated route in your framework
if ($_SERVER['REQUEST_URI'] === '/metrics') {
header('Content-type: text/plain');
$renderer = new RenderText();
echo $renderer->render($registry->getMetricFamilySamples());
exit;
}
// --- Example Usage within Magento (simplified) ---
// $cached_data = get_from_redis_cache('my_product_data_123', 'page_cache');
// if ($cached_data === null) {
// // Populate cache...
// $redis_client->set('my_product_data_123', $new_data, 3600); // Cache for 1 hour
// }
// --- To run this example locally for testing ---
// echo "Simulating cache operations...\n";
// get_from_redis_cache('test_key_1', 'data'); // Miss
// get_from_redis_cache('test_key_1', 'data'); // Hit
// get_from_redis_cache('test_key_2', 'data'); // Miss
// get_from_redis_cache('test_key_2', 'data'); // Hit
// get_from_redis_cache('test_key_2', 'data'); // Hit
// echo "\n--- Metrics ---\n";
// $renderer = new RenderText();
// echo $renderer->render($registry->getMetricFamilySamples());
?>
You would then configure Prometheus (or Google Cloud’s Managed Service for Prometheus) to scrape the /metrics endpoint of your Magento application. This provides granular visibility into application-level performance.
Google Cloud Infrastructure Monitoring
Leveraging Google Cloud’s native monitoring tools is essential for understanding the health of your underlying infrastructure.
Compute Engine (GCE) and GKE Monitoring
For GCE instances running your Magento app, monitor key metrics:
compute.googleapis.com/instance/cpu/utilization: CPU usage.compute.googleapis.com/instance/memory/usage: Memory usage (requires Ops Agent or custom metric collection).compute.googleapis.com/instance/network/received_bytes_countandtransmitted_bytes_count: Network traffic.compute.googleapis.com/instance/disk/read_ops_countandwrite_ops_count: Disk I/O operations.compute.googleapis.com/instance/disk/bytes_readandbytes_written: Disk throughput.
For Google Kubernetes Engine (GKE), monitor cluster-level and node-level metrics:
container.googleapis.com/pod/cpu/utilizationcontainer.googleapis.com/pod/memory/usagecontainer.googleapis.com/container/network/received_bytes_countkubernetes.io/node/cpu/utilizationkubernetes.io/node/memory/utilization
Set up alerting policies in Cloud Monitoring for thresholds on these metrics. For instance, trigger an alert if CPU utilization consistently exceeds 85% for 15 minutes, or if memory usage approaches critical levels.
Cloud SQL / Memorystore Monitoring
For managed database services like Cloud SQL (if used for Magento’s primary DB) or Memorystore for Redis, Cloud Monitoring provides pre-built dashboards and metrics.
Key Memorystore metrics to watch:
redis.googleapis.com/stats/connected_clients: Number of active client connections.redis.googleapis.com/stats/commands_processed: Rate of commands processed.redis.googleapis.com/stats/evicted_keys: Number of keys evicted due to memory limits.redis.googleapis.com/stats/keyspace_hitsandkeyspace_misses: Cache hit/miss ratio.redis.googleapis.com/memory/usage: Memory usage.
For Cloud SQL, monitor metrics like CPU utilization, memory utilization, disk I/O, network traffic, active connections, and query latency.
Load Balancer Monitoring
Google Cloud Load Balancers (HTTP(S), TCP/SSL Proxy) are critical for distributing traffic. Monitor their performance:
loadbalancing.googleapis.com/https/request_countloadbalancing.googleapis.com/https/response_code_count(broken down by code: 2xx, 3xx, 4xx, 5xx)loadbalancing.googleapis.com/https/backend_latenciesloadbalancing.googleapis.com/https/backend_connection_close_count
Pay close attention to the 5xx response code count and backend latencies. Spikes here often indicate issues with your backend Magento application instances or Redis.
Alerting Strategy and Best Practices
A robust alerting strategy is the culmination of effective monitoring. It should be actionable, minimize noise, and prioritize critical issues.
Define Alerting Tiers
Categorize alerts based on severity:
- Critical: Immediate action required. System outage, major data loss risk, severe performance degradation impacting all users. (e.g., Redis cluster down, Magento 5xx error rate > 10%).
- Warning: Investigate soon. Potential for future issues, minor performance impact, non-critical component failure. (e.g., Redis node in PFAIL state, high but not critical error rate, low disk space).
- Informational: For awareness. Routine events, capacity planning insights. (e.g., High traffic periods, cache warming).
Actionable Alerting
Each alert should have a clear owner and a documented runbook or escalation procedure. Avoid alerts that simply state a metric is high without context or a clear next step. For example, an alert for “Redis memory usage > 90%” should link to a runbook detailing how to investigate memory leaks, scale Redis, or clear cache.
Leverage Cloud Monitoring Notification Channels
Configure Google Cloud Monitoring to send notifications via:
- PagerDuty
- Slack
- Pub/Sub (for custom webhook integrations)
For critical alerts, PagerDuty or similar on-call management tools are essential. For less critical issues, Slack notifications can be sufficient.
Regular Review and Tuning
Monitoring and alerting systems are not “set and forget.” Regularly review alert thresholds, false positives, and the effectiveness of your runbooks. As your Magento application evolves and scales, your monitoring strategy must adapt.