Server Monitoring Best Practices: Keeping Your Shopify App and MySQL Clusters Alive on OVH

Proactive MySQL Replication Lag Monitoring

For a high-availability Shopify app relying on MySQL, replication lag is a critical metric. Unchecked lag can lead to stale data being served to users, impacting the user experience and potentially causing data inconsistencies. We’ll focus on setting up robust monitoring for this using Percona Monitoring and Management (PMM) and custom alerting.

The core of replication lag monitoring lies in querying the `Seconds_Behind_Master` status variable on your read replicas. However, simply checking this value periodically isn’t enough. We need to establish thresholds and trigger alerts when these thresholds are breached for a sustained period.

Leveraging Percona Monitoring and Management (PMM)

PMM is an open-source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL performance. It provides dashboards, query analytics, and alerting capabilities out-of-the-box. For MySQL replication, PMM automatically collects `Seconds_Behind_Master` and exposes it as a Prometheus metric.

Assuming you have PMM installed and your MySQL instances (both master and replicas) are registered with it, you can access the relevant metrics via Prometheus. The metric we’re interested in is typically named something like mysql_slave_status_seconds_behind_master. This metric is available per-instance.

Configuring Prometheus Alerting Rules

Prometheus Alertmanager is the standard way to handle alerts generated by Prometheus. We’ll define alerting rules in a Prometheus configuration file (e.g., /etc/prometheus/rules/mysql_replication.yml).

Here’s a sample Prometheus alerting rule to detect replication lag exceeding 60 seconds for more than 5 minutes:

groups:
- name: mysql_replication_alerts
  rules:
  - alert: MySQLReplicationLagging
    expr: |
      avg_over_time(mysql_slave_status_seconds_behind_master{job="mysql"}[5m]) > 60
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "MySQL replication lag detected on {{ $labels.instance }}"
      description: "Replication lag on {{ $labels.instance }} has been greater than 60 seconds for the last 5 minutes. Current lag: {{ $value }} seconds."

Explanation:

alert: MySQLReplicationLagging: The name of the alert.
expr: The PromQL query. avg_over_time(mysql_slave_status_seconds_behind_master{job="mysql"}[5m]) > 60 calculates the average lag over the last 5 minutes and checks if it exceeds 60 seconds. The {job="mysql"} selector should be adjusted based on your Prometheus scrape configuration for MySQL.
for: 5m: The alert will only fire if the condition in expr is true for at least 5 minutes. This prevents flapping alerts due to transient network issues.
labels: Metadata attached to the alert. severity: critical is a common label for routing alerts.
annotations: Human-readable information about the alert. {{ $labels.instance }} and {{ $value }} are template variables that will be populated with the actual instance name and lag value when the alert fires.

To apply these rules, ensure your Prometheus configuration (prometheus.yml) includes a `rule_files` directive pointing to this file:

global:
  scrape_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  # ... your scrape configurations for MySQL ...

After updating the configuration, reload Prometheus (e.g., kill -HUP <prometheus_pid> or via its HTTP API). Ensure your Alertmanager is configured to receive alerts from Prometheus and route them to your desired notification channels (Slack, PagerDuty, email, etc.).

Custom MySQL Health Checks with `mysqladmin` and `check_mysql_replication`

While PMM is excellent for metrics and alerting, sometimes you need more direct, script-based checks, especially for initial setup or in environments where PMM might not be fully deployed. We can use the `mysqladmin` command-line utility and a simple shell script.

Scripting Replication Status Checks

A common approach is to have a script run on each replica that checks its replication status and reports any issues. This script can then be integrated with a monitoring system like Nagios, Zabbix, or even a simple cron job that sends alerts.

Here’s a Python script that connects to a MySQL replica, checks `Seconds_Behind_Master`, and exits with a non-zero status code if lag exceeds a threshold. This is ideal for integration with systems that monitor exit codes.

import MySQLdb
import sys
import time

# --- Configuration ---
DB_HOST = "localhost"
DB_USER = "monitor_user"
DB_PASSWORD = "your_monitor_password"
DB_NAME = "information_schema" # Or any database the user can connect to
REPLICA_HOST = "localhost" # The host of the replica itself
MAX_LAG_SECONDS = 60
MAX_UNREACHABLE_ATTEMPTS = 3
# ---------------------

def check_replication_lag(host, user, password, db_name, max_lag, max_attempts):
    lag_seconds = -1
    unreachable_count = 0
    connection = None

    try:
        connection = MySQLdb.connect(host=host, user=user, passwd=password, db=db_name)
        cursor = connection.cursor()

        # Check if this server is a replica
        cursor.execute("SHOW SLAVE STATUS")
        status = cursor.fetchone()

        if status is None:
            print(f"ERROR: Server {host} is not configured as a replication replica.")
            return 1 # Critical error, not a replica

        # Find the index for Seconds_Behind_Master
        # This is a bit fragile as column order can change, but common
        column_names = [desc[0] for desc in cursor.description]
        try:
            lag_index = column_names.index("Seconds_Behind_Master")
            lag_value = status[lag_index]
            if lag_value is not None:
                lag_seconds = int(lag_value)
            else:
                # If Seconds_Behind_Master is NULL, it usually means it's caught up or not running
                # We'll treat NULL as 0 for simplicity, but a more robust check might look at Slave_IO_Running and Slave_SQL_Running
                lag_seconds = 0
        except ValueError:
            print(f"ERROR: 'Seconds_Behind_Master' column not found in SHOW SLAVE STATUS output on {host}.")
            return 1 # Critical error, cannot determine lag

        if lag_seconds > max_lag:
            print(f"CRITICAL: Replication lag on {host} is {lag_seconds} seconds (>{max_lag}s).")
            return 2 # Critical alert
        else:
            print(f"OK: Replication lag on {host} is {lag_seconds} seconds (within {max_lag}s threshold).")
            return 0 # OK

    except MySQLdb.OperationalError as e:
        # Handle connection errors, network issues, etc.
        error_code, error_message = e.args
        print(f"ERROR: Could not connect to MySQL on {host}: {error_message} (Code: {error_code})")
        # In a real-world scenario, you'd want to track consecutive failures
        # For simplicity here, we'll just return a non-zero code
        return 1 # Warning/Error

    except Exception as e:
        print(f"ERROR: An unexpected error occurred on {host}: {e}")
        return 1 # General error

    finally:
        if connection:
            connection.close()

if __name__ == "__main__":
    # Ensure the monitor user has at least USAGE and SELECT privileges
    # CREATE USER 'monitor_user'@'%' IDENTIFIED BY 'your_monitor_password';
    # GRANT USAGE, SELECT ON *.* TO 'monitor_user'@'%';
    # FLUSH PRIVILEGES;

    exit_code = check_replication_lag(REPLICA_HOST, DB_USER, DB_PASSWORD, DB_NAME, MAX_LAG_SECONDS, MAX_UNREACHABLE_ATTEMPTS)
    sys.exit(exit_code)

Setup Notes:

Create a dedicated MySQL user (e.g., monitor_user) with minimal privileges: USAGE and SELECT on any database. This user should be able to connect from the monitoring host.
The script assumes the monitoring is run *on* the replica. If running remotely, change REPLICA_HOST to the replica’s IP/hostname.
The script uses MySQLdb (a fork of PyMySQL is often used as a drop-in replacement). Install it: pip install PyMySQL.
Exit codes: 0 for OK, 1 for Warning/Error, 2 for Critical. Adjust these as needed for your monitoring system.
The `SHOW SLAVE STATUS` output parsing is basic. For extreme robustness, consider parsing the output more carefully or using `SHOW SLAVE STATUS\G` and processing the key-value pairs.

OVH Specific Considerations for MySQL

When running MySQL clusters on OVH infrastructure, several factors require specific attention:

Network Latency and Bandwidth

OVH’s network performance can vary. High latency between your application servers and MySQL replicas, or between replicas themselves, can indirectly cause replication lag even if the database servers are healthy. Monitor network metrics (ping times, packet loss, bandwidth utilization) between your application hosts and your database instances.

Tools like mtr (My traceroute) can be invaluable for diagnosing network path issues:

# Run from your application server to a MySQL replica
mtr --report --report-cycles 10 <mysql_replica_ip_or_hostname>

If you observe consistently high latency or packet loss on specific hops within OVH’s network, you might need to contact their support or consider alternative network configurations (e.g., dedicated links if available and cost-effective).

Instance Resource Utilization

OVH offers various instance types. Ensure your MySQL instances (especially replicas) are provisioned with sufficient CPU, RAM, and IOPS. High disk I/O wait times on a replica can significantly slow down its ability to apply the binary log events, leading to lag.

Monitor disk I/O metrics using tools like iostat:

# Monitor disk I/O statistics, focusing on %iowait
iostat -dx 5

If %iowait is consistently high (e.g., > 10-15%), it indicates a disk bottleneck. This might require upgrading to instances with faster storage (e.g., NVMe SSDs) or optimizing MySQL’s configuration (e.g., buffer pool size, innodb_flush_log_at_trx_commit).

OVH Control Panel and API Monitoring

OVH provides a control panel and an API to manage your infrastructure. While not direct database monitoring, it’s crucial to monitor the health of the underlying OVH services themselves. Check for any OVH service status announcements or alerts related to the region or services your databases are running on.

Automating checks against the OVH API for service health or resource alerts can be beneficial. For example, you could use `curl` to query OVH’s status pages or specific API endpoints if they expose health checks for their managed services.

Application-Level Monitoring for Stale Data

Even with perfect replication lag monitoring, your application might still serve stale data if it incorrectly routes read queries or if the lag is acceptable for some operations but not others. Implement application-level checks to detect stale data.

“Heartbeat” Table and Query Timing

A common technique is to use a “heartbeat” table. This table is periodically updated by the primary database. Read replicas can then query this table and compare the timestamp with their own clock or a known reference point. If the heartbeat timestamp is too old, it indicates a problem with the replica applying writes, even if `Seconds_Behind_Master` appears low (e.g., due to clock skew or specific replication issues).

Primary Database (Master):

-- Create the heartbeat table
CREATE TABLE IF NOT EXISTS heartbeat (
    id INT PRIMARY KEY,
    last_heartbeat TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
INSERT IGNORE INTO heartbeat (id) VALUES (1);

-- The `last_heartbeat` column will automatically update.

Read Replica (or Application Querying Replica):

-- Query the heartbeat table on the replica
SELECT last_heartbeat FROM heartbeat WHERE id = 1;

Your application or a dedicated monitoring script can then query this on the replica. If the `last_heartbeat` timestamp is older than a defined threshold (e.g., 10 seconds), it signifies a problem. This check is more resilient to clock skew than just relying on `Seconds_Behind_Master` alone.

Furthermore, instrument your application’s read queries. Log the time taken for queries executed against read replicas. If query latency spikes, it could be an indicator of underlying replication issues or resource contention on the replica.

Conclusion

Maintaining a healthy MySQL cluster for a critical Shopify application on OVH requires a multi-layered monitoring strategy. Combining PMM for comprehensive metrics and alerting, custom scripts for granular checks, and application-level heartbeats ensures you can detect and respond to issues proactively, minimizing downtime and data inconsistencies.