Server Monitoring Best Practices: Keeping Your WordPress App and MySQL Clusters Alive on Linode
Proactive MySQL Replication Lag Detection
For any WordPress deployment relying on a MySQL cluster, particularly with read replicas, replication lag is a silent killer. Unchecked lag can lead to stale data being served to users, broken cron jobs, and a general degradation of application performance. We need a robust, automated mechanism to detect and alert on this condition before it impacts users.
A common and effective method is to query the `Seconds_Behind_Master` status variable on each replica. However, simply checking this value periodically isn’t enough. We need to establish a baseline, account for transient network blips, and ensure our monitoring system can reliably connect to the replicas.
Implementing a MySQL Replication Lag Check Script (Python)
We’ll craft a Python script that connects to each replica, fetches `Seconds_Behind_Master`, and triggers an alert if the lag exceeds a configurable threshold for a sustained period. This script can be run via cron or a dedicated monitoring agent.
First, ensure you have the necessary Python MySQL connector installed:
pip install mysql-connector-python
Here’s the Python script:
import mysql.connector
import time
import smtplib
from email.mime.text import MIMEText
# --- Configuration ---
MYSQL_REPLICAS = [
{'host': 'replica1.yourdomain.com', 'user': 'monitor_user', 'password': 'your_monitor_password', 'database': 'wordpress_db'},
{'host': 'replica2.yourdomain.com', 'user': 'monitor_user', 'password': 'your_monitor_password', 'database': 'wordpress_db'},
# Add more replicas as needed
]
LAG_THRESHOLD_SECONDS = 60 # Alert if lag exceeds 60 seconds
CONSISTENT_CHECKS_FOR_ALERT = 3 # Number of consecutive checks exceeding threshold to trigger alert
CHECK_INTERVAL_SECONDS = 30 # How often to run this check
ALERT_EMAIL_FROM = '[email protected]'
ALERT_EMAIL_TO = '[email protected]'
SMTP_SERVER = 'smtp.yourdomain.com'
SMTP_PORT = 587
SMTP_USER = '[email protected]'
SMTP_PASSWORD = 'your_smtp_password'
# --- End Configuration ---
# In-memory state to track consistent lag
lag_counts = {replica['host']: 0 for replica in MYSQL_REPLICAS}
def send_alert(subject, body):
'''Sends an email alert.'''
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = ALERT_EMAIL_FROM
msg['To'] = ALERT_EMAIL_TO
try:
with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
server.starttls()
server.login(SMTP_USER, SMTP_PASSWORD)
server.sendmail(ALERT_EMAIL_FROM, ALERT_EMAIL_TO, msg.as_string())
print(f"Alert sent: {subject}")
except Exception as e:
print(f"Failed to send alert: {e}")
def check_replication_lag():
'''Checks replication lag for all configured replicas.'''
global lag_counts
all_replicas_ok = True
for replica in MYSQL_REPLICAS:
host = replica['host']
user = replica['user']
password = replica['password']
database = replica['database']
try:
conn = mysql.connector.connect(
host=host,
user=user,
password=password,
database=database,
connect_timeout=10 # Short timeout for connection
)
cursor = conn.cursor()
cursor.execute("SHOW SLAVE STATUS")
status = cursor.fetchone()
if status:
# Find the index for Seconds_Behind_Master. This is brittle if SHOW SLAVE STATUS output changes.
# A more robust approach would be to fetch column names.
# For simplicity here, we assume a standard output order.
# A better way:
# cursor.execute("SHOW COLUMNS FROM mysql.slave_master_info")
# column_names = [col[0] for col in cursor.fetchall()]
# try:
# lag_index = column_names.index('Seconds_Behind_Master')
# except ValueError:
# print(f"Error: 'Seconds_Behind_Master' column not found on {host}")
# lag_index = -1 # Indicate error
# Assuming standard output order for simplicity in this example
# The actual index might vary slightly based on MySQL version/configuration.
# It's typically around index 15 or 16. Let's try a common one.
# A safer bet is to query `SHOW GLOBAL STATUS LIKE 'Seconds_Behind_Master';` on the replica itself.
# However, `SHOW SLAVE STATUS` is more direct for replication.
# Let's assume the output of `SHOW SLAVE STATUS` is a tuple and we need to find the index.
# A more robust way to get the index:
cursor.execute("SHOW SLAVE STATUS")
column_names = [desc[0] for desc in cursor.description]
try:
lag_index = column_names.index('Seconds_Behind_Master')
seconds_behind_master = status[lag_index]
except ValueError:
print(f"Warning: 'Seconds_Behind_Master' not found in SHOW SLAVE STATUS output for {host}. Skipping lag check for this replica.")
seconds_behind_master = None # Treat as OK if not found
if seconds_behind_master is not None:
if seconds_behind_master == 'NULL': # Replication not running or not configured
print(f"Info: Replication not running or configured on {host}. Seconds_Behind_Master is NULL.")
lag_counts[host] = 0 # Reset count if replication stops
elif int(seconds_behind_master) > LAG_THRESHOLD_SECONDS:
lag_counts[host] += 1
print(f"Warning: High replication lag on {host}: {seconds_behind_master}s (Threshold: {LAG_THRESHOLD_SECONDS}s). Count: {lag_counts[host]}/{CONSISTENT_CHECKS_FOR_ALERT}")
if lag_counts[host] >= CONSISTENT_CHECKS_FOR_ALERT:
all_replicas_ok = False
subject = f"ALERT: High MySQL Replication Lag on {host}"
body = f"MySQL replica {host} is experiencing high replication lag.\n\n" \
f"Seconds Behind Master: {seconds_behind_master}s\n" \
f"Threshold: {LAG_THRESHOLD_SECONDS}s\n" \
f"Consecutive checks exceeding threshold: {lag_counts[host]}\n\n" \
f"Please investigate immediately."
send_alert(subject, body)
# Optionally reset count after alert to avoid spamming, or keep it to indicate persistent issue
# lag_counts[host] = 0
else:
# Lag is within acceptable limits, reset the counter
if lag_counts[host] > 0:
print(f"Info: Replication lag on {host} is back to normal ({seconds_behind_master}s). Resetting lag count.")
lag_counts[host] = 0
else:
print(f"Warning: Could not retrieve SHOW SLAVE STATUS for {host}.")
all_replicas_ok = False # Treat as an error if status can't be retrieved
cursor.close()
conn.close()
except mysql.connector.Error as err:
print(f"Error connecting to or querying MySQL on {host}: {err}")
# If connection fails, we can't determine lag. Treat as a potential issue.
# We might want to increment a separate counter for connection errors.
# For now, let's assume a connection error means we can't confirm it's OK.
all_replicas_ok = False
# Reset lag count on connection error to avoid false positives if it recovers quickly
lag_counts[host] = 0
except Exception as e:
print(f"An unexpected error occurred for {host}: {e}")
all_replicas_ok = False
lag_counts[host] = 0
if all_replicas_ok and any(count >= CONSISTENT_CHECKS_FOR_ALERT for count in lag_counts.values()):
# This case handles if multiple replicas triggered alerts and we want a consolidated alert
# or if we reset counts after alert and now all are OK but some were previously bad.
# For simplicity, we alert individually above. This block could be used for a summary alert.
pass
if __name__ == "__main__":
print("Starting MySQL replication lag monitoring...")
while True:
check_replication_lag()
print(f"Sleeping for {CHECK_INTERVAL_SECONDS} seconds...")
time.sleep(CHECK_INTERVAL_SECONDS)
Key considerations for this script:
- `MYSQL_REPLICAS`: This list must contain the connection details for each MySQL replica. Ensure the `monitor_user` has at least `REPLICATION CLIENT` privileges.
- `LAG_THRESHOLD_SECONDS`: The maximum acceptable lag. Tune this based on your application’s tolerance for stale data.
- `CONSISTENT_CHECKS_FOR_ALERT`: To prevent flapping alerts from transient network issues, we only alert after the lag has been consistently high for this many checks.
- `CHECK_INTERVAL_SECONDS`: How frequently the script runs.
- Email Configuration: Fill in your SMTP server details for sending alerts.
- Error Handling: The script includes basic error handling for connection issues and missing `Seconds_Behind_Master`.
- `SHOW SLAVE STATUS` parsing: The script attempts to find `Seconds_Behind_Master` by inspecting column names. This is more robust than assuming a fixed index.
WordPress Application Health Checks
Beyond the database, the WordPress application itself needs constant monitoring. This includes checking if the web server is responding, if PHP is processing requests correctly, and if core WordPress functionalities are operational.
HTTP Health Check Endpoint
The simplest form of application health check is an HTTP endpoint that returns a 200 OK status if the application is alive. For WordPress, we can create a simple plugin or a dedicated PHP file.
Create a file named healthcheck.php in your WordPress root directory (or a dedicated plugin):
<?php
/**
* Simple WordPress Health Check Endpoint
*
* Place this file in your WordPress root directory.
* Access it via: https://yourdomain.com/healthcheck.php
*/
// Load WordPress core
require_once('wp-load.php');
// Basic check: Is WordPress loaded?
if (!defined('ABSPATH')) {
header('HTTP/1.1 500 Internal Server Error');
echo 'WordPress not loaded.';
exit;
}
// More advanced checks can be added here:
// 1. Database connection check
global $wpdb;
if ($wpdb->check_connection() === false) {
header('HTTP/1.1 503 Service Unavailable');
echo 'Database connection failed.';
exit;
}
// 2. Check if essential WordPress constants are defined
if (!defined('WP_HOME') || !defined('WP_SITEURL')) {
header('HTTP/1.1 503 Service Unavailable');
echo 'Essential WordPress constants not defined.';
exit;
}
// 3. (Optional) Check for critical errors or recent fatal errors
// This requires more complex logic, potentially checking error logs or WP_DEBUG_LOG.
// For a simple check, we assume if we reach here, the basics are fine.
// If all checks pass
header('HTTP/1.1 200 OK');
echo 'WordPress is healthy.';
exit;
?>
This script performs a basic check by loading WordPress and then verifies the database connection. You can extend this by checking for specific WordPress options, plugin/theme statuses, or even making a dummy post query.
Integrating with Monitoring Tools (e.g., Prometheus/Grafana)
For comprehensive monitoring, integrating these checks into a system like Prometheus and visualizing them in Grafana is ideal. We can use `node_exporter` and `mysqld_exporter` for system and MySQL metrics, and custom exporters or simple `curl` checks for application-level health.
Prometheus Configuration for MySQL Lag
While the Python script above handles alerting, Prometheus can also scrape metrics. `mysqld_exporter` can expose `Seconds_Behind_Master` if configured correctly. Alternatively, you can run a Prometheus *blackbox exporter* that periodically hits your healthcheck.php endpoint or queries MySQL.
Here’s a snippet for Prometheus to scrape `mysqld_exporter` (assuming it’s configured to expose replication status):
scrape_configs:
- job_name: 'mysql_replicas'
static_configs:
- targets: ['replica1.yourdomain.com:9104', 'replica2.yourdomain.com:9104'] # Assuming mysqld_exporter runs on port 9104
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '$1'
- source_labels: [__address__]
target_label: __param_scrape_uri
regex: '.*:9104'
replacement: '/metrics/replica' # Example: if mysqld_exporter has a specific endpoint for replicas
- job_name: 'wordpress_app_health'
metrics_path: /probe
params:
module: [http_2xx] # Use the http_2xx module from blackbox exporter
static_configs:
- targets:
- https://yourdomain.com/healthcheck.php
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: 'https?://([^/]+)/.*'
replacement: '$1'
And a corresponding Prometheus alerting rule for MySQL lag:
groups:
- name: mysql_replication_alerts
rules:
- alert: HighMySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master{job="mysql_replicas"} > 60
for: 5m # Alert if lag is high for 5 minutes
labels:
severity: critical
annotations:
summary: "High MySQL replication lag detected on {{ $labels.instance }}"
description: "MySQL replica {{ $labels.instance }} has been lagging by more than 60 seconds for 5 minutes. Current lag: {{ $value }}s."
Server-Level Monitoring on Linode
Linode provides basic infrastructure metrics (CPU, RAM, Disk I/O, Network). However, for deep insights into your WordPress and MySQL performance, you need more granular monitoring at the OS and application levels.
Node Exporter for System Metrics
Install `node_exporter` on each Linode instance to expose OS-level metrics to Prometheus.
# Download the latest release wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz cd node_exporter-1.7.0.linux-amd64 # Run it (for testing) ./node_exporter # For production, set up as a systemd service sudo cp node_exporter /usr/local/bin/ sudo nano /etc/systemd/system/node_exporter.service
Content for /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=nobody
Group=nobody
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.textfile.directory=/var/lib/node_exporter/textfile-collector
[Install]
WantedBy=multi-user.target
Then enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter sudo systemctl status node_exporter
MySQL Exporter Configuration
Similarly, `mysqld_exporter` provides detailed MySQL metrics. Ensure it’s configured with credentials that have sufficient read-only access.
# Download and install mysqld_exporter (similar process to node_exporter) # ... installation steps ... # Create a .my.cnf file for credentials sudo nano /etc/mysqld_exporter.cnf
Content for /etc/mysqld_exporter.cnf:
[client] user=monitor_user password=your_monitor_password host=localhost # Or the specific IP if running remotely
Then set up `mysqld_exporter` as a systemd service, pointing to this configuration file.
Log Analysis and Alerting
Centralized logging is crucial. Tools like Elasticsearch/Logstash/Kibana (ELK stack) or Loki/Promtail/Grafana can aggregate logs from all your Linode instances and MySQL servers. This allows for easier debugging and setting up alerts based on specific log patterns.
Monitoring WordPress Debug Logs
Ensure `WP_DEBUG` and `WP_DEBUG_LOG` are enabled in your wp-config.php on staging/development environments. For production, you might want to log specific errors or use a plugin that tails the debug log. If `WP_DEBUG_LOG` is enabled, monitor the wp-content/debug.log file.
# Example using Promtail to ship debug.log to Loki # In your promtail config (e.g., /etc/promtail/config.yaml): # scrape_configs: # - job_name: wordpress_debug_log # static_configs: # - targets: # - localhost # labels: # job: wordpress # __path__: /var/www/html/wp-content/debug.log
Once logs are in Loki, you can create Grafana alerts based on error messages (e.g., `level=”error”` or specific PHP fatal error patterns).
Conclusion
Maintaining a healthy WordPress application on Linode, especially with a MySQL cluster, requires a multi-layered monitoring strategy. By proactively checking MySQL replication lag, implementing robust application health endpoints, leveraging system-level exporters, and centralizing log analysis, you can significantly improve uptime and performance, ensuring your WordPress site remains responsive and reliable.