Server Monitoring Best Practices: Keeping Your WordPress App and MySQL Clusters Alive on Linode

Proactive MySQL Replication Lag Detection

For any WordPress deployment relying on a MySQL cluster, particularly with read replicas, replication lag is a silent killer. Unchecked lag can lead to stale data being served to users, broken cron jobs, and a general degradation of application performance. We need a robust, automated mechanism to detect and alert on this condition before it impacts users.

A common and effective method is to query the `Seconds_Behind_Master` status variable on each replica. However, simply checking this value periodically isn’t enough. We need to establish a baseline, account for transient network blips, and ensure our monitoring system can reliably connect to the replicas.

Implementing a MySQL Replication Lag Check Script (Python)

We’ll craft a Python script that connects to each replica, fetches `Seconds_Behind_Master`, and triggers an alert if the lag exceeds a configurable threshold for a sustained period. This script can be run via cron or a dedicated monitoring agent.

First, ensure you have the necessary Python MySQL connector installed:

pip install mysql-connector-python

Here’s the Python script:

import mysql.connector
import time
import smtplib
from email.mime.text import MIMEText

# --- Configuration ---
MYSQL_REPLICAS = [
    {'host': 'replica1.yourdomain.com', 'user': 'monitor_user', 'password': 'your_monitor_password', 'database': 'wordpress_db'},
    {'host': 'replica2.yourdomain.com', 'user': 'monitor_user', 'password': 'your_monitor_password', 'database': 'wordpress_db'},
    # Add more replicas as needed
]
LAG_THRESHOLD_SECONDS = 60  # Alert if lag exceeds 60 seconds
CONSISTENT_CHECKS_FOR_ALERT = 3 # Number of consecutive checks exceeding threshold to trigger alert
CHECK_INTERVAL_SECONDS = 30 # How often to run this check

ALERT_EMAIL_FROM = '[email protected]'
ALERT_EMAIL_TO = '[email protected]'
SMTP_SERVER = 'smtp.yourdomain.com'
SMTP_PORT = 587
SMTP_USER = '[email protected]'
SMTP_PASSWORD = 'your_smtp_password'
# --- End Configuration ---

# In-memory state to track consistent lag
lag_counts = {replica['host']: 0 for replica in MYSQL_REPLICAS}

def send_alert(subject, body):
    '''Sends an email alert.'''
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = ALERT_EMAIL_FROM
    msg['To'] = ALERT_EMAIL_TO

    try:
        with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
            server.starttls()
            server.login(SMTP_USER, SMTP_PASSWORD)
            server.sendmail(ALERT_EMAIL_FROM, ALERT_EMAIL_TO, msg.as_string())
        print(f"Alert sent: {subject}")
    except Exception as e:
        print(f"Failed to send alert: {e}")

def check_replication_lag():
    '''Checks replication lag for all configured replicas.'''
    global lag_counts
    all_replicas_ok = True

    for replica in MYSQL_REPLICAS:
        host = replica['host']
        user = replica['user']
        password = replica['password']
        database = replica['database']

        try:
            conn = mysql.connector.connect(
                host=host,
                user=user,
                password=password,
                database=database,
                connect_timeout=10 # Short timeout for connection
            )
            cursor = conn.cursor()
            cursor.execute("SHOW SLAVE STATUS")
            status = cursor.fetchone()

            if status:
                # Find the index for Seconds_Behind_Master. This is brittle if SHOW SLAVE STATUS output changes.
                # A more robust approach would be to fetch column names.
                # For simplicity here, we assume a standard output order.
                # A better way:
                # cursor.execute("SHOW COLUMNS FROM mysql.slave_master_info")
                # column_names = [col[0] for col in cursor.fetchall()]
                # try:
                #     lag_index = column_names.index('Seconds_Behind_Master')
                # except ValueError:
                #     print(f"Error: 'Seconds_Behind_Master' column not found on {host}")
                #     lag_index = -1 # Indicate error

                # Assuming standard output order for simplicity in this example
                # The actual index might vary slightly based on MySQL version/configuration.
                # It's typically around index 15 or 16. Let's try a common one.
                # A safer bet is to query `SHOW GLOBAL STATUS LIKE 'Seconds_Behind_Master';` on the replica itself.
                # However, `SHOW SLAVE STATUS` is more direct for replication.
                # Let's assume the output of `SHOW SLAVE STATUS` is a tuple and we need to find the index.
                # A more robust way to get the index:
                cursor.execute("SHOW SLAVE STATUS")
                column_names = [desc[0] for desc in cursor.description]
                try:
                    lag_index = column_names.index('Seconds_Behind_Master')
                    seconds_behind_master = status[lag_index]
                except ValueError:
                    print(f"Warning: 'Seconds_Behind_Master' not found in SHOW SLAVE STATUS output for {host}. Skipping lag check for this replica.")
                    seconds_behind_master = None # Treat as OK if not found

                if seconds_behind_master is not None:
                    if seconds_behind_master == 'NULL': # Replication not running or not configured
                        print(f"Info: Replication not running or configured on {host}. Seconds_Behind_Master is NULL.")
                        lag_counts[host] = 0 # Reset count if replication stops
                    elif int(seconds_behind_master) > LAG_THRESHOLD_SECONDS:
                        lag_counts[host] += 1
                        print(f"Warning: High replication lag on {host}: {seconds_behind_master}s (Threshold: {LAG_THRESHOLD_SECONDS}s). Count: {lag_counts[host]}/{CONSISTENT_CHECKS_FOR_ALERT}")
                        if lag_counts[host] >= CONSISTENT_CHECKS_FOR_ALERT:
                            all_replicas_ok = False
                            subject = f"ALERT: High MySQL Replication Lag on {host}"
                            body = f"MySQL replica {host} is experiencing high replication lag.\n\n" \
                                   f"Seconds Behind Master: {seconds_behind_master}s\n" \
                                   f"Threshold: {LAG_THRESHOLD_SECONDS}s\n" \
                                   f"Consecutive checks exceeding threshold: {lag_counts[host]}\n\n" \
                                   f"Please investigate immediately."
                            send_alert(subject, body)
                            # Optionally reset count after alert to avoid spamming, or keep it to indicate persistent issue
                            # lag_counts[host] = 0
                    else:
                        # Lag is within acceptable limits, reset the counter
                        if lag_counts[host] > 0:
                            print(f"Info: Replication lag on {host} is back to normal ({seconds_behind_master}s). Resetting lag count.")
                        lag_counts[host] = 0
            else:
                print(f"Warning: Could not retrieve SHOW SLAVE STATUS for {host}.")
                all_replicas_ok = False # Treat as an error if status can't be retrieved

            cursor.close()
            conn.close()

        except mysql.connector.Error as err:
            print(f"Error connecting to or querying MySQL on {host}: {err}")
            # If connection fails, we can't determine lag. Treat as a potential issue.
            # We might want to increment a separate counter for connection errors.
            # For now, let's assume a connection error means we can't confirm it's OK.
            all_replicas_ok = False
            # Reset lag count on connection error to avoid false positives if it recovers quickly
            lag_counts[host] = 0
        except Exception as e:
            print(f"An unexpected error occurred for {host}: {e}")
            all_replicas_ok = False
            lag_counts[host] = 0

    if all_replicas_ok and any(count >= CONSISTENT_CHECKS_FOR_ALERT for count in lag_counts.values()):
        # This case handles if multiple replicas triggered alerts and we want a consolidated alert
        # or if we reset counts after alert and now all are OK but some were previously bad.
        # For simplicity, we alert individually above. This block could be used for a summary alert.
        pass

if __name__ == "__main__":
    print("Starting MySQL replication lag monitoring...")
    while True:
        check_replication_lag()
        print(f"Sleeping for {CHECK_INTERVAL_SECONDS} seconds...")
        time.sleep(CHECK_INTERVAL_SECONDS)

Key considerations for this script:

`MYSQL_REPLICAS`: This list must contain the connection details for each MySQL replica. Ensure the `monitor_user` has at least `REPLICATION CLIENT` privileges.
`LAG_THRESHOLD_SECONDS`: The maximum acceptable lag. Tune this based on your application’s tolerance for stale data.
`CONSISTENT_CHECKS_FOR_ALERT`: To prevent flapping alerts from transient network issues, we only alert after the lag has been consistently high for this many checks.
`CHECK_INTERVAL_SECONDS`: How frequently the script runs.
Email Configuration: Fill in your SMTP server details for sending alerts.
Error Handling: The script includes basic error handling for connection issues and missing `Seconds_Behind_Master`.
`SHOW SLAVE STATUS` parsing: The script attempts to find `Seconds_Behind_Master` by inspecting column names. This is more robust than assuming a fixed index.

WordPress Application Health Checks

Beyond the database, the WordPress application itself needs constant monitoring. This includes checking if the web server is responding, if PHP is processing requests correctly, and if core WordPress functionalities are operational.

HTTP Health Check Endpoint

The simplest form of application health check is an HTTP endpoint that returns a 200 OK status if the application is alive. For WordPress, we can create a simple plugin or a dedicated PHP file.

Create a file named healthcheck.php in your WordPress root directory (or a dedicated plugin):

<?php
/**
 * Simple WordPress Health Check Endpoint
 *
 * Place this file in your WordPress root directory.
 * Access it via: https://yourdomain.com/healthcheck.php
 */

// Load WordPress core
require_once('wp-load.php');

// Basic check: Is WordPress loaded?
if (!defined('ABSPATH')) {
    header('HTTP/1.1 500 Internal Server Error');
    echo 'WordPress not loaded.';
    exit;
}

// More advanced checks can be added here:
// 1. Database connection check
global $wpdb;
if ($wpdb->check_connection() === false) {
    header('HTTP/1.1 503 Service Unavailable');
    echo 'Database connection failed.';
    exit;
}

// 2. Check if essential WordPress constants are defined
if (!defined('WP_HOME') || !defined('WP_SITEURL')) {
    header('HTTP/1.1 503 Service Unavailable');
    echo 'Essential WordPress constants not defined.';
    exit;
}

// 3. (Optional) Check for critical errors or recent fatal errors
// This requires more complex logic, potentially checking error logs or WP_DEBUG_LOG.
// For a simple check, we assume if we reach here, the basics are fine.

// If all checks pass
header('HTTP/1.1 200 OK');
echo 'WordPress is healthy.';
exit;
?>

This script performs a basic check by loading WordPress and then verifies the database connection. You can extend this by checking for specific WordPress options, plugin/theme statuses, or even making a dummy post query.

Integrating with Monitoring Tools (e.g., Prometheus/Grafana)

For comprehensive monitoring, integrating these checks into a system like Prometheus and visualizing them in Grafana is ideal. We can use `node_exporter` and `mysqld_exporter` for system and MySQL metrics, and custom exporters or simple `curl` checks for application-level health.

Prometheus Configuration for MySQL Lag

While the Python script above handles alerting, Prometheus can also scrape metrics. `mysqld_exporter` can expose `Seconds_Behind_Master` if configured correctly. Alternatively, you can run a Prometheus *blackbox exporter* that periodically hits your healthcheck.php endpoint or queries MySQL.

Here’s a snippet for Prometheus to scrape `mysqld_exporter` (assuming it’s configured to expose replication status):

scrape_configs:
  - job_name: 'mysql_replicas'
    static_configs:
      - targets: ['replica1.yourdomain.com:9104', 'replica2.yourdomain.com:9104'] # Assuming mysqld_exporter runs on port 9104
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '$1'
      - source_labels: [__address__]
        target_label: __param_scrape_uri
        regex: '.*:9104'
        replacement: '/metrics/replica' # Example: if mysqld_exporter has a specific endpoint for replicas

  - job_name: 'wordpress_app_health'
    metrics_path: /probe
    params:
      module: [http_2xx] # Use the http_2xx module from blackbox exporter
    static_configs:
      - targets:
          - https://yourdomain.com/healthcheck.php
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: 'https?://([^/]+)/.*'
        replacement: '$1'

And a corresponding Prometheus alerting rule for MySQL lag:

groups:
- name: mysql_replication_alerts
  rules:
  - alert: HighMySQLReplicationLag
    expr: mysql_slave_status_seconds_behind_master{job="mysql_replicas"} > 60
    for: 5m # Alert if lag is high for 5 minutes
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag detected on {{ $labels.instance }}"
      description: "MySQL replica {{ $labels.instance }} has been lagging by more than 60 seconds for 5 minutes. Current lag: {{ $value }}s."

Server-Level Monitoring on Linode

Linode provides basic infrastructure metrics (CPU, RAM, Disk I/O, Network). However, for deep insights into your WordPress and MySQL performance, you need more granular monitoring at the OS and application levels.

Node Exporter for System Metrics

Install `node_exporter` on each Linode instance to expose OS-level metrics to Prometheus.

# Download the latest release
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Run it (for testing)
./node_exporter

# For production, set up as a systemd service
sudo cp node_exporter /usr/local/bin/
sudo nano /etc/systemd/system/node_exporter.service

Content for /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nobody
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.textfile.directory=/var/lib/node_exporter/textfile-collector

[Install]
WantedBy=multi-user.target

Then enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

MySQL Exporter Configuration

Similarly, `mysqld_exporter` provides detailed MySQL metrics. Ensure it’s configured with credentials that have sufficient read-only access.

# Download and install mysqld_exporter (similar process to node_exporter)
# ... installation steps ...

# Create a .my.cnf file for credentials
sudo nano /etc/mysqld_exporter.cnf

Content for /etc/mysqld_exporter.cnf:

[client]
user=monitor_user
password=your_monitor_password
host=localhost # Or the specific IP if running remotely

Then set up `mysqld_exporter` as a systemd service, pointing to this configuration file.

Log Analysis and Alerting

Centralized logging is crucial. Tools like Elasticsearch/Logstash/Kibana (ELK stack) or Loki/Promtail/Grafana can aggregate logs from all your Linode instances and MySQL servers. This allows for easier debugging and setting up alerts based on specific log patterns.

Monitoring WordPress Debug Logs

Ensure `WP_DEBUG` and `WP_DEBUG_LOG` are enabled in your wp-config.php on staging/development environments. For production, you might want to log specific errors or use a plugin that tails the debug log. If `WP_DEBUG_LOG` is enabled, monitor the wp-content/debug.log file.

# Example using Promtail to ship debug.log to Loki
# In your promtail config (e.g., /etc/promtail/config.yaml):
# scrape_configs:
#   - job_name: wordpress_debug_log
#     static_configs:
#       - targets:
#           - localhost
#         labels:
#           job: wordpress
#           __path__: /var/www/html/wp-content/debug.log

Once logs are in Loki, you can create Grafana alerts based on error messages (e.g., `level=”error”` or specific PHP fatal error patterns).

Conclusion

Maintaining a healthy WordPress application on Linode, especially with a MySQL cluster, requires a multi-layered monitoring strategy. By proactively checking MySQL replication lag, implementing robust application health endpoints, leveraging system-level exporters, and centralizing log analysis, you can significantly improve uptime and performance, ensuring your WordPress site remains responsive and reliable.