Server Monitoring Best Practices: Keeping Your Shopify App and MySQL Clusters Alive on Google Cloud
Proactive MySQL Cluster Health Checks with `pt-heartbeat`
Maintaining the health and replication lag of MySQL clusters, especially those powering critical Shopify applications, demands more than just basic CPU/memory monitoring. For high-availability setups, particularly with replication, understanding replication lag is paramount. The Percona Toolkit’s `pt-heartbeat` is an indispensable tool for this. It writes a timestamp to a dedicated table and monitors the replication stream to ensure the replica is keeping up.
First, ensure you have Percona Toolkit installed on your MySQL primary and replica nodes. On Debian/Ubuntu systems, this is typically:
sudo apt-get update sudo apt-get install percona-toolkit
On your primary MySQL server, create a dedicated table to store the heartbeat timestamp. This table should be in a database accessible by the replication user.
CREATE DATABASE IF NOT EXISTS monitoring;
USE monitoring;
CREATE TABLE IF NOT EXISTS heartbeat (
server_id INT UNSIGNED NOT NULL PRIMARY KEY,
ts DATETIME(6) NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
ROW_FORMAT=COMPACT
) ENGINE=InnoDB;
Now, configure `pt-heartbeat` to run on the primary. This script will periodically update the `ts` column for its `server_id`. We’ll use a common server ID, say `1`, for the primary.
pt-heartbeat --host=YOUR_PRIMARY_HOST --user=REPLICATION_USER --password=REPLICATION_PASSWORD --database=monitoring --table=heartbeat --server-id=1 --interval=1
On each replica, you’ll run `pt-heartbeat` in a slightly different mode. It will read the timestamp from the primary’s heartbeat table and compare it to the time it receives events from the primary. This allows it to calculate the replication lag.
pt-heartbeat --host=YOUR_REPLICA_HOST --user=REPLICATION_USER --password=REPLICATION_PASSWORD --database=monitoring --table=heartbeat --monitor --interval=1 --replication-master-host=YOUR_PRIMARY_HOST --replication-master-user=REPLICATION_USER --replication-master-password=REPLICATION_PASSWORD
The `–monitor` flag is key here. `pt-heartbeat` will output the replication lag in seconds. This output can then be scraped by your monitoring system (e.g., Prometheus, Datadog). For Prometheus, you’d typically use `promtail` or a custom exporter to collect this metric.
Application-Level Health Checks for Shopify Apps
Beyond database health, your Shopify application’s responsiveness and internal state are critical. For a PHP-based Shopify app, this often involves checking external API dependencies, cache health, and internal service availability.
Implement a dedicated health check endpoint in your application. This endpoint should perform a series of checks and return a standardized response, typically JSON, indicating the overall health status and details of any failing components.
<?php
// healthcheck.php
header('Content-Type: application/json');
$response = [
'status' => 'ok',
'checks' => [],
];
// 1. Check database connection
try {
// Assuming you have a PDO connection object $pdo
// $pdo = new PDO(...);
$pdo->query('SELECT 1');
$response['checks']['database'] = ['status' => 'ok'];
} catch (PDOException $e) {
$response['status'] = 'error';
$response['checks']['database'] = ['status' => 'error', 'message' => $e->getMessage()];
}
// 2. Check external API (e.g., Shopify API)
$shopifyApiUrl = 'https://your-shop-domain.myshopify.com/admin/api/2023-10/products.json?limit=1'; // Example endpoint
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $shopifyApiUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 5); // 5-second timeout
// Add necessary authentication headers here
// curl_setopt($ch, CURLOPT_HTTPHEADER, ['X-Shopify-Access-Token: YOUR_ACCESS_TOKEN']);
$output = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode && $httpCode < 400) {
$response['checks']['shopify_api'] = ['status' => 'ok', 'http_code' => $httpCode];
} else {
$response['status'] = 'error';
$response['checks']['shopify_api'] = ['status' => 'error', 'http_code' => $httpCode, 'message' => 'Failed to connect or received error from Shopify API.'];
}
// 3. Check cache (e.g., Redis)
// Assuming you have a Redis client object $redis
// if ($redis->ping()) {
// $response['checks']['redis'] = ['status' => 'ok'];
// } else {
// $response['status'] = 'error';
// $response['checks']['redis'] = ['status' => 'error', 'message' => 'Redis connection failed.'];
// }
// Set HTTP status code based on overall health
http_response_code($response['status'] === 'ok' ? 200 : 503);
echo json_encode($response);
exit;
?>
This endpoint should be accessible by your load balancer or monitoring probes. For a Google Cloud environment, you can leverage Cloud Monitoring’s uptime checks or configure your load balancer health checks to point to this endpoint.
Nginx Configuration for Health Check Endpoints
To ensure your health check endpoint is correctly routed and accessible, configure your Nginx web server. This involves creating a specific `location` block that bypasses any application logic and directly serves the health check script.
server {
listen 80;
server_name your-app.com;
root /var/www/your-app/public;
index index.php index.html index.htm;
location / {
try_files $uri $uri/ /index.php?$query_string;
}
# Health check endpoint
location = /healthcheck.php {
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust to your PHP-FPM socket
internal; # Only allow internal requests
}
# ... other PHP configurations ...
}
The `internal` directive is crucial here. It ensures that this `location` block can only be accessed by Nginx itself (e.g., via `try_files` or other internal redirects) and not directly by external clients. This prevents unauthorized access to your health check script while allowing your load balancer or monitoring tools to probe it.
Google Cloud Monitoring Integration
Google Cloud Monitoring (formerly Stackdriver) provides robust tools for collecting, visualizing, and alerting on metrics. For your MySQL clusters and Shopify app, you’ll want to integrate the metrics gathered from `pt-heartbeat` and your application’s health checks.
Ingesting `pt-heartbeat` Metrics:
- Cloud Monitoring Agent: Install the Cloud Monitoring agent on your GCE instances. Configure it to scrape metrics from your application or custom exporters. For `pt-heartbeat`, you might need a custom exporter that reads the output of `pt-heartbeat –monitor` and exposes it in a Prometheus-compatible format, which the agent can then scrape.
- Prometheus Integration: If you’re already using Prometheus, configure it to scrape `pt-heartbeat`’s output. Then, use the Cloud Monitoring Prometheus integration to ingest these metrics into Cloud Monitoring.
Ingesting Application Health Checks:
- Uptime Checks: Configure Google Cloud Uptime Checks to periodically hit your application’s `/healthcheck.php` endpoint. These checks can verify both the availability (HTTP 200 status) and the response content. Alerts can be configured directly within Cloud Monitoring based on uptime check failures.
- Custom Metrics: If your health check endpoint returns detailed JSON, you can write a small script (e.g., Python) that runs periodically, calls the health check endpoint, parses the JSON, and pushes custom metrics (e.g., database status, API status) to Cloud Monitoring using the Cloud Monitoring API client libraries.
Example Python script to push custom metrics:
import google.auth
from google.cloud import monitoring_v3
from google.protobuf.timestamp_pb2 import Timestamp
import requests
import time
# --- Configuration ---
PROJECT_ID = "your-gcp-project-id"
HEALTH_CHECK_URL = "http://your-app.com/healthcheck.php" # Or internal IP if probing from GCE
METRIC_SCOPE = "custom.googleapis.com" # Or your custom metric scope
# --- Authentication ---
credentials, project = google.auth.default()
client = monitoring_v3.MetricServiceClient(credentials=credentials)
project_name = f"projects/{PROJECT_ID}"
# --- Health Check and Metric Push ---
def push_health_metrics():
try:
response = requests.get(HEALTH_CHECK_URL, timeout=10)
response.raise_for_status() # Raise an exception for bad status codes
data = response.json()
now = time.time()
seconds = int(now)
nanos = int((now - seconds) * 10**9)
timestamp = Timestamp(seconds=seconds, nanos=nanos)
series = []
# Overall status metric
series.append({
"metric": {
"type": f"{METRIC_SCOPE}/app_health_status",
"labels": {"environment": "production"}
},
"resource": {
"type": "gce_instance", # Or 'gke_container', 'global', etc.
"labels": {
"project_id": PROJECT_ID,
"instance_id": "your-instance-id", # If applicable
"zone": "your-instance-zone" # If applicable
}
},
"points": [{"interval": {"end_time": timestamp}, "value": {"int64_value": 1 if data['status'] == 'ok' else 0}}],
})
# Individual check status metrics
for check_name, check_data in data.get('checks', {}).items():
value = 1 if check_data.get('status') == 'ok' else 0
series.append({
"metric": {
"type": f"{METRIC_SCOPE}/app_component_status",
"labels": {"component": check_name, "environment": "production"}
},
"resource": {
"type": "gce_instance",
"labels": {
"project_id": PROJECT_ID,
"instance_id": "your-instance-id",
"zone": "your-instance-zone"
}
},
"points": [{"interval": {"end_time": timestamp}, "value": {"int64_value": value}}],
})
if series:
client.create_time_series(name=project_name, time_series=series)
print("Successfully pushed health metrics.")
else:
print("No metrics to push.")
except requests.exceptions.RequestException as e:
print(f"Error fetching health check: {e}")
# Optionally push a metric indicating the health check itself failed
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == "__main__":
push_health_metrics()
Schedule this script to run periodically (e.g., every minute) using `cron` on a dedicated monitoring instance or within a Kubernetes CronJob if your application is containerized.
Alerting Strategies
Effective alerting is the culmination of robust monitoring. Configure alerts in Cloud Monitoring for critical conditions:
- High Replication Lag: Set an alert when `pt-heartbeat` reports a replication lag exceeding a defined threshold (e.g., 60 seconds) for a sustained period.
- Application Unavailability: Trigger an alert when Cloud Monitoring Uptime Checks fail for a specific duration.
- Component Failures: Create alerts for individual component failures reported by your application’s health check endpoint (e.g., database connection lost, external API unresponsive).
- Resource Saturation: Monitor standard GCE metrics like CPU utilization, memory usage, disk I/O, and network traffic. Set alerts for sustained high utilization that could indicate impending performance issues.
Ensure your alert notification channels are configured correctly (e.g., email, PagerDuty, Slack) and that alert policies have appropriate thresholds and durations to minimize noise while ensuring critical issues are addressed promptly.