Server Monitoring Best Practices: Keeping Your Magento 2 App and MySQL Clusters Alive on OVH

Proactive MySQL Replication Lag Detection and Alerting

For a high-traffic Magento 2 instance, especially one leveraging a MySQL cluster for scalability and high availability, replication lag is a critical metric. Unchecked lag can lead to stale data being served to users, inconsistent inventory, and ultimately, lost revenue. OVH’s managed MySQL services, while robust, still require diligent monitoring. We’ll focus on a practical, agent-based approach using `check_mysql_replication` from the Nagios plugins suite, integrated with a custom alerting mechanism.

The `check_mysql_replication` plugin is a powerful tool that connects to both the master and replica servers, queries `SHOW REPLICA STATUS` (or `SHOW SLAVE STATUS` for older versions), and analyzes the `Seconds_Behind_Master` value. We’ll configure it to trigger a warning if lag exceeds a configurable threshold (e.g., 60 seconds) and a critical alert if it surpasses a more aggressive limit (e.g., 300 seconds).

Setting up `check_mysql_replication`

First, ensure the Nagios plugins are installed on a dedicated monitoring server or one of your application servers that has network access to your MySQL cluster. On Debian/Ubuntu systems:

sudo apt update
sudo apt install nagios-plugins nagios-plugins-contrib

Next, create a dedicated MySQL user for monitoring. This user needs `REPLICATION CLIENT` privileges. It’s crucial to restrict this user to specific hosts for security. On your MySQL master server:

CREATE USER 'monitor'@'your_monitoring_server_ip' IDENTIFIED BY 'your_secure_password';
GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'your_monitoring_server_ip';
FLUSH PRIVILEGES;

Repeat this for each replica server, adjusting the host and potentially the user if you prefer distinct credentials per replica (though a single user with broad access is often simpler for monitoring). Ensure the monitoring user has the same privileges on all replica servers.

Configuring the Check

The command syntax for `check_mysql_replication` is as follows. We’ll use a configuration file for credentials to avoid embedding them directly in the command line, which is a security best practice.

Create a credentials file (e.g., `/etc/nagios/mysql.cnf`) on your monitoring host:

[client]
user=monitor
password=your_secure_password
host=your_mysql_master_ip
port=3306

Secure this file:

sudo chmod 600 /etc/nagios/mysql.cnf

Now, construct the check command. This example checks replication from a master to a replica. You’ll need to run this command targeting each replica, pointing to its respective master.

/usr/lib/nagios/plugins/check_mysql_replication \
--defaults-file=/etc/nagios/mysql.cnf \
--host=your_mysql_replica_ip \
--user=monitor \
--password=your_secure_password \
--warning=60 \
--critical=300 \
--master-host=your_mysql_master_ip \
--master-user=monitor \
--master-password=your_secure_password \
--master-port=3306

Note: If your MySQL cluster uses different credentials for master and replica connections (e.g., for replication setup), adjust the `–master-*` parameters accordingly. For simplicity, we’re using the same `monitor` user here, assuming it has the necessary permissions on both ends.

To automate this, you would typically integrate this command into a monitoring system like Nagios, Zabbix, or Prometheus (via `node_exporter`’s `textfile_collector`). For a basic setup, you can use cron:

# Example cron job to run every minute
* * * * * /usr/lib/nagios/plugins/check_mysql_replication --defaults-file=/etc/nagios/mysql.cnf --host=your_mysql_replica_ip --user=monitor --password=your_secure_password --warning=60 --critical=300 --master-host=your_mysql_master_ip --master-user=monitor --master-password=your_secure_password --master-port=3306 >> /var/log/mysql_replication_check.log 2>&1
# You would then need a separate script to parse this log and trigger alerts.

Advanced Alerting with Custom Scripts

Relying solely on cron and log files for alerting is brittle. A more robust approach involves piping the output of `check_mysql_replication` to a custom script that can send notifications via Slack, PagerDuty, or email. This script can also implement more sophisticated logic, such as de-duplication or escalation.

Here’s a Python script that takes the plugin’s output and sends a Slack notification:

import subprocess
import sys
import requests
import json
import os

# Configuration
SLACK_WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL", "YOUR_SLACK_WEBHOOK_URL")
MYSQL_REPLICA_HOST = "your_mysql_replica_ip"
MYSQL_MASTER_HOST = "your_mysql_master_ip"
MYSQL_USER = "monitor"
MYSQL_PASSWORD = "your_secure_password"
MYSQL_DEFAULTS_FILE = "/etc/nagios/mysql.cnf"
WARNING_THRESHOLD = 60
CRITICAL_THRESHOLD = 300

def send_slack_notification(message, level="warning"):
    if not SLACK_WEBHOOK_URL or SLACK_WEBHOOK_URL == "YOUR_SLACK_WEBHOOK_URL":
        print("SLACK_WEBHOOK_URL not configured. Skipping notification.")
        return

    color = "#FFFF00" if level == "warning" else "#FF0000"
    payload = {
        "attachments": [
            {
                "color": color,
                "title": f"Magento MySQL Replication Alert ({level.upper()})",
                "text": message,
                "fields": [
                    {"title": "Replica Host", "value": MYSQL_REPLICA_HOST, "short": True},
                    {"title": "Master Host", "value": MYSQL_MASTER_HOST, "short": True},
                ],
                "ts": int(subprocess.check_output("date +%s").decode().strip())
            }
        ]
    }
    try:
        response = requests.post(SLACK_WEBHOOK_URL, data=json.dumps(payload),
                                 headers={'Content-Type': 'application/json'})
        response.raise_for_status()
        print(f"Slack notification sent successfully: {message}")
    except requests.exceptions.RequestException as e:
        print(f"Error sending Slack notification: {e}", file=sys.stderr)

def check_replication():
    command = [
        "/usr/lib/nagios/plugins/check_mysql_replication",
        f"--defaults-file={MYSQL_DEFAULTS_FILE}",
        f"--host={MYSQL_REPLICA_HOST}",
        f"--user={MYSQL_USER}",
        f"--password={MYSQL_PASSWORD}",
        f"--warning={WARNING_THRESHOLD}",
        f"--critical={CRITICAL_THRESHOLD}",
        f"--master-host={MYSQL_MASTER_HOST}",
        f"--master-user={MYSQL_USER}",
        f"--master-password={MYSQL_PASSWORD}",
        "--master-port=3306"
    ]

    try:
        result = subprocess.run(command, capture_output=True, text=True, check=False)
        output = result.stdout.strip()
        stderr = result.stderr.strip()
        return_code = result.returncode

        if return_code == 0:
            print(f"OK: {output}")
            return
        elif return_code == 1:
            print(f"WARNING: {output}")
            send_slack_notification(f"Replication lag detected: {output}", level="warning")
        elif return_code == 2:
            print(f"CRITICAL: {output}")
            send_slack_notification(f"High replication lag detected: {output}", level="critical")
        else:
            print(f"UNKNOWN: {output} | STDERR: {stderr}", file=sys.stderr)
            send_slack_notification(f"MySQL replication check failed: {output}\nSTDERR: {stderr}", level="critical")

    except FileNotFoundError:
        print("Error: check_mysql_replication plugin not found.", file=sys.stderr)
        send_slack_notification("check_mysql_replication plugin not found on monitoring server.", level="critical")
    except Exception as e:
        print(f"An unexpected error occurred: {e}", file=sys.stderr)
        send_slack_notification(f"An unexpected error occurred during MySQL replication check: {e}", level="critical")

if __name__ == "__main__":
    # Ensure SLACK_WEBHOOK_URL is set as an environment variable for security
    if not SLACK_WEBHOOK_URL or SLACK_WEBHOOK_URL == "YOUR_SLACK_WEBHOOK_URL":
        print("Error: SLACK_WEBHOOK_URL environment variable not set.", file=sys.stderr)
        sys.exit(1)
    check_replication()

To use this script:

Save it as `check_mysql_replication_alert.py`.
Make it executable: `chmod +x check_mysql_replication_alert.py`.
Set the `SLACK_WEBHOOK_URL` environment variable on the monitoring server: `export SLACK_WEBHOOK_URL=’https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX’`.
Schedule it via cron to run every minute: `* * * * * /path/to/your/script/check_mysql_replication_alert.py`.

Monitoring Magento 2 Application Health

Beyond the database, the Magento 2 application itself needs robust health checks. This involves monitoring key processes, resource utilization, and application-specific endpoints.

Web Server (Nginx/Apache) Process and Health Checks

Ensure your web server processes are running and responsive. For Nginx, this typically involves checking the `nginx` master and worker processes. For Apache, it’s `apache2` or `httpd`.

# Check Nginx process status
sudo systemctl is-active nginx

# Check Apache process status
sudo systemctl is-active apache2

A more comprehensive check involves sending a request to a specific health check endpoint on your Magento application. This endpoint should perform basic checks like database connectivity and cache status.

Create a simple health check script in your Magento root directory (e.g., `healthcheck.php`):

<?php
require 'app/bootstrap.php';
$bootstrap = \Magento\Framework\App\Bootstrap::create(BP, $_SERVER);

try {
    // Check database connection
    $objectManager = $bootstrap->getObjectManager();
    $resource = $objectManager->get('Magento\Framework\App\ResourceConnection');
    $connection = $resource->getConnection();
    $connection->query('SELECT 1'); // Simple query to test connection

    // Check cache status (example: checking if default cache type is enabled)
    $cacheManager = $objectManager->get('Magento\Framework\App\Cache\StateInterface');
    if (!$cacheManager->isEnabled('default')) {
        throw new \Exception('Cache is disabled.');
    }

    // Add more checks as needed (e.g., Redis connection, Elasticsearch connection)

    header('HTTP/1.1 200 OK');
    echo "OK: Magento application is healthy.";
    exit(0);

} catch (\Exception $e) {
    header('HTTP/1.1 503 Service Unavailable');
    echo "ERROR: " . $e->getMessage();
    exit(1);
}
?>

Secure this file by ensuring it’s not directly accessible via URL manipulation (e.g., by placing it outside the webroot or using Nginx/Apache configuration to restrict access). Then, use `curl` from your monitoring server:

curl -s -o /dev/null -w "%{http_code}\n" http://your-magento-domain.com/healthcheck.php

This command will return `200` for healthy and `503` for unhealthy. You can integrate this into your monitoring system’s checks.

Resource Monitoring (CPU, Memory, Disk, Network)

Essential for any server, but particularly critical for Magento’s resource-intensive nature. Tools like `htop`, `vmstat`, `iostat`, and `netstat` are invaluable for manual inspection. For automated monitoring, Prometheus with `node_exporter` is a standard choice.

CPU and Memory Usage

High CPU or memory usage can cripple a Magento application. Monitor these metrics closely. For instance, using `node_exporter`’s default metrics:

# Example Prometheus query for high CPU usage on a specific instance
sum(rate(node_cpu_seconds_total{mode="idle", instance="your_magento_server:9100"}[5m])) by (instance)
# This query calculates the inverse of idle time, effectively CPU usage.
# You'd set an alert when this value drops below a certain percentage (e.g., 80%).

# Example Prometheus query for high memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Alert when this percentage exceeds a threshold (e.g., 90%).

Disk I/O and Space

Magento’s disk I/O can be a bottleneck, especially during indexing or heavy cron jobs. Disk space is also critical to prevent service interruptions.

# Check disk usage percentage (via node_exporter)
node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"} * 100
# Alert when this value is below a threshold (e.g., 15%).

# Check disk read/write operations per second (via node_exporter)
rate(node_disk_reads_completed_total{device="sda"}[1m])
rate(node_disk_writes_completed_total{device="sda"}[1m])
# Monitor these for sustained high rates that might indicate a bottleneck.

Network Traffic

Monitor network ingress and egress to detect unusual traffic patterns or potential DoS attacks.

# Network traffic in bytes per second (via node_exporter)
rate(node_network_receive_bytes_total{device="eth0"}[1m])
rate(node_network_transmit_bytes_total{device="eth0"}[1m])
# Set alerts for abnormally high traffic volumes.

Log Monitoring and Analysis

Centralized log management is crucial for debugging and security. Tools like Elasticsearch, Logstash, and Kibana (ELK stack), or Grafana Loki, are essential for aggregating logs from all your Magento and database servers.

Key Log Files to Monitor

Magento `system.log` and `exception.log` (located in `var/log/`).
Web server access and error logs (e.g., `/var/log/nginx/access.log`, `/var/log/nginx/error.log`).
PHP-FPM error logs.
MySQL error logs.
System logs (`/var/log/syslog`, `/var/log/auth.log`).

Configure your log shipping agent (e.g., Filebeat, Promtail) to tail these files and send them to your central logging system. Set up alerts for specific error patterns (e.g., `5xx` status codes in Nginx logs, specific PHP exceptions, MySQL errors).

OVH Specific Considerations

OVH’s infrastructure offers specific tools and considerations:

OVH Control Panel: Regularly check the OVH control panel for any hardware alerts, network issues, or service notifications related to your instances and managed databases.
Managed MySQL: While OVH manages the underlying infrastructure, they often provide basic performance metrics and logs through their portal. Integrate these into your broader monitoring strategy where possible.
Network Monitoring: OVH provides network traffic statistics. Correlate these with your server-level network monitoring to identify potential external issues or internal misconfigurations.
Instance Snapshots: Ensure you have a robust snapshot strategy for your Magento servers. While not strictly monitoring, it’s a critical part of disaster recovery and can be triggered by severe monitoring alerts.

By implementing a layered monitoring strategy that combines application-level checks, database health, resource utilization, and centralized logging, you can ensure the stability and performance of your Magento 2 application on OVH infrastructure.