Server Monitoring Best Practices: Keeping Your Magento 2 App and MySQL Clusters Alive on Google Cloud

Proactive MySQL Replication Lag Detection

For Magento 2, especially with a read-replica setup for scaling, replication lag is a critical failure point. Unchecked, it can lead to stale data being served to users, impacting orders and inventory accuracy. We need a robust, automated mechanism to detect and alert on this lag before it becomes a production issue. This involves querying the replication status directly from MySQL and setting up a threshold.

We’ll use a simple Bash script that runs periodically via cron. This script connects to the replica, checks the `Seconds_Behind_Master` value, and triggers an alert if it exceeds a predefined limit. For alerting, we can integrate with tools like PagerDuty, Slack, or simply send an email.

Bash Script for MySQL Replication Lag Monitoring

Create a script, for example, /opt/scripts/check_mysql_replication.sh. Ensure it’s executable (`chmod +x /opt/scripts/check_mysql_replication.sh`).

This script requires MySQL credentials. It’s best practice to use a dedicated monitoring user with minimal privileges (e.g., `REPLICATION CLIENT`). Store these credentials securely, perhaps in a separate configuration file with restricted permissions.

Let’s define the script:

Configuration File (/etc/mysql/monitor.cnf):

[client]
user=monitor_user
password=your_secure_password
host=your_mysql_replica_host
port=3306

Bash Script (/opt/scripts/check_mysql_replication.sh):

#!/bin/bash

# --- Configuration ---
REPLICA_HOST="your_mysql_replica_host" # Redundant if using .cnf, but good for clarity/fallback
REPLICA_USER="monitor_user"
REPLICA_PASSWORD="your_secure_password" # Consider using a .cnf file for better security
REPLICA_DB="mysql" # The database to connect to for status
LAG_THRESHOLD_SECONDS=300 # Alert if lag is greater than 5 minutes
ALERT_EMAIL="[email protected]"
ALERT_SUBJECT="CRITICAL: MySQL Replication Lag on ${REPLICA_HOST}"
MYSQL_CONFIG_FILE="/etc/mysql/monitor.cnf" # Path to the credentials file

# --- Functions ---
send_alert() {
    local message="$1"
    # Basic email alert. Replace with your preferred alerting mechanism (e.g., PagerDuty API, Slack webhook)
    echo -e "Subject: ${ALERT_SUBJECT}\n\n${message}" | mail -s "${ALERT_SUBJECT}" "${ALERT_EMAIL}"
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ALERT: ${message}"
}

# --- Main Logic ---
echo "$(date '+%Y-%m-%d %H:%M:%S') - Checking MySQL replication lag on ${REPLICA_HOST}..."

# Construct the mysql command, prioritizing the config file
MYSQL_CMD="mysql --defaults-extra-file=${MYSQL_CONFIG_FILE} -h ${REPLICA_HOST} -D ${REPLICA_DB} -e 'SHOW REPLICA STATUS\\G'"

# Execute the command and capture output
REPLICA_STATUS=$(eval ${MYSQL_CMD} 2>&1)

# Check if the command executed successfully
if [ $? -ne 0 ]; then
    send_alert "Failed to connect to MySQL replica ${REPLICA_HOST} or execute SHOW REPLICA STATUS. Output: ${REPLICA_STATUS}"
    exit 1
fi

# Extract Seconds_Behind_Master
# Using grep and awk for robust parsing, handling potential variations in output
LAG_SECONDS=$(echo "${REPLICA_STATUS}" | grep -i "Seconds_Behind_Master:" | awk '{print $2}')

# Check if LAG_SECONDS was successfully extracted
if [[ -z "$LAG_SECONDS" || "$LAG_SECONDS" == "NULL" ]]; then
    # This might happen if replication is stopped or not configured.
    # We might want to alert on this too, depending on expected state.
    # For now, let's assume it's an issue if we can't get a valid number.
    send_alert "Could not determine Seconds_Behind_Master for ${REPLICA_HOST}. Replication status output: ${REPLICA_STATUS}"
    exit 1
fi

echo "$(date '+%Y-%m-%d %H:%M:%S') - Current replication lag: ${LAG_SECONDS} seconds."

# Compare lag with threshold
if [ "${LAG_SECONDS}" -gt "${LAG_THRESHOLD_SECONDS}" ]; then
    ALERT_MESSAGE="Replication lag on ${REPLICA_HOST} has exceeded the threshold of ${LAG_THRESHOLD_SECONDS} seconds. Current lag: ${LAG_SECONDS} seconds.\n\nReplication Status:\n${REPLICA_STATUS}"
    send_alert "${ALERT_MESSAGE}"
    exit 1
else
    echo "$(date '+%Y-%m-%d %H:%M:%S') - Replication lag is within acceptable limits."
    exit 0
fi

Cron Job Setup:

Edit the crontab for the user that will run this script (e.g., root or a dedicated monitoring user):

sudo crontab -e

Add a line to run the script every minute:

# Check MySQL replication lag every minute
* * * * * /opt/scripts/check_mysql_replication.sh >> /var/log/mysql_replication_check.log 2>&1

Security Note: Ensure /etc/mysql/monitor.cnf has strict permissions: chmod 600 /etc/mysql/monitor.cnf and chown root:root /etc/mysql/monitor.cnf (or the user running the cron job).

Magento 2 Application Health Checks

Beyond database health, the Magento 2 application itself needs constant monitoring. This includes checking if the web server (Nginx/Apache) is responding, if PHP-FPM is healthy, and if the Magento application is serving valid responses.

Nginx/Apache Health Check

A basic check is to ensure the web server is listening on its port and responding to HTTP requests. We can use curl for this.

Script Snippet (can be integrated into a larger monitoring script):

#!/bin/bash

MAGENTO_URL="https://your-magento-domain.com"
EXPECTED_HTTP_CODE=200
TIMEOUT_SECONDS=10

echo "Checking Magento application health at ${MAGENTO_URL}..."

HTTP_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout ${TIMEOUT_SECONDS} --max-time ${TIMEOUT_SECONDS} ${MAGENTO_URL})

if [ "$HTTP_RESPONSE" != "${EXPECTED_HTTP_CODE}" ]; then
    echo "ALERT: Magento application at ${MAGENTO_URL} returned HTTP code ${HTTP_RESPONSE} (expected ${EXPECTED_HTTP_CODE})."
    # Add alerting mechanism here
    exit 1
else
    echo "Magento application is responding with HTTP ${HTTP_CODE}."
fi

PHP-FPM Status Check

PHP-FPM has a built-in status page that can provide valuable insights into its health and performance. To enable this, you need to configure your PHP-FPM pool.

Edit your PHP-FPM pool configuration file (e.g., /etc/php/8.1/fpm/pool.d/www.conf or similar):

; Add or uncomment these lines in your pool configuration
pm.status_path = /fpm_status
; Ensure the socket or port is accessible by your web server
; listen = /run/php/php8.1-fpm.sock
; listen.owner = www-data
; listen.group = www-data
; listen.mode = 0660

Configure Nginx to proxy requests to the PHP-FPM status page:

# Inside your Magento site's Nginx configuration (e.g., /etc/nginx/sites-available/magento)
location ~ ^/fpm_status$ {
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    # Use the correct socket or port for your PHP-FPM pool
    fastcgi_pass unix:/run/php/php8.1-fpm.sock;
    # Or if using TCP/IP:
    # fastcgi_pass 127.0.0.1:9000;
    
    # Restrict access to only your monitoring IP or localhost
    # allow 192.168.1.100;
    # deny all;
}

After making these changes, reload Nginx and PHP-FPM:

sudo systemctl reload nginx
sudo systemctl reload php8.1-fpm # Adjust version as needed

Now you can use curl to check the status:

curl http://your-magento-domain.com/fpm_status

This will output statistics like:

pool: www
process manager: dynamic
start for:   1678878000
accepted conn:   12345678
listen queue:    0
max listen queue:    0
listen a:    12345678
 12345678
active processes:    5
max active processes:    10
max children reached:    0
slow requests:   0

You can script checks for specific metrics, such as ensuring active processes is not zero and listen queue is low.

Magento Application-Level Checks

For deeper application health, consider implementing custom health check endpoints within your Magento application. This allows you to verify critical dependencies like database connectivity (beyond just replication status), cache status, and even external service integrations.

Create a custom module (e.g., Appseconnect/Healthcheck) and define a route for your health check.

app/code/Appseconnect/Healthcheck/etc/routes.xml

<?xml version="1.0"?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:App/etc/routes.xsd">
    <router id="standard">
        <route frontName="healthcheck">
            <module name="Appseconnect_Healthcheck" />
        </route>
    </router>
</config>

app/code/Appseconnect/Healthcheck/Controller/Index/Index.php

<?php
namespace Appseconnect\Healthcheck\Controller\Index;

use Magento\Framework\App\Action\Action;
use Magento\Framework\App\Action\Context;
use Magento\Framework\Controller\Result\JsonFactory;
use Magento\Framework\App\ResourceConnection;
use Magento\Framework\App\CacheInterface;

class Index extends Action
{
    /**
     * @var JsonFactory
     */
    protected $resultJsonFactory;

    /**
     * @var ResourceConnection
     */
    protected $resourceConnection;

    /**
     * @var CacheInterface
     */
    protected $cache;

    /**
     * @param Context $context
     * @param JsonFactory $resultJsonFactory
     * @param ResourceConnection $resourceConnection
     * @param CacheInterface $cache
     */
    public function __construct(
        Context $context,
        JsonFactory $resultJsonFactory,
        ResourceConnection $resourceConnection,
        CacheInterface $cache
    ) {
        parent::__construct($context);
        $this->resultJsonFactory = $resultJsonFactory;
        $this->resourceConnection = $resourceConnection;
        $this->cache = $cache;
    }

    public function execute()
    {
        $result = [
            'status' => 'FAIL',
            'checks' => []
        ];

        // 1. Database Connection Check
        try {
            $connection = $this->resourceConnection->getConnection();
            $connection->query('SELECT 1'); // Simple query to test connection
            $result['checks']['database'] = ['status' => 'OK'];
        } catch (\Exception $e) {
            $result['checks']['database'] = ['status' => 'FAIL', 'error' => $e->getMessage()];
        }

        // 2. Cache Check (e.g., Redis, Varnish)
        try {
            $this->cache->save('healthcheck_test', 'healthcheck_test', ['healthcheck'], 60);
            $this->cache->remove('healthcheck_test');
            $result['checks']['cache'] = ['status' => 'OK'];
        } catch (\Exception $e) {
            $result['checks']['cache'] = ['status' => 'FAIL', 'error' => $e->getMessage()];
        }

        // Add more checks here:
        // - External API connectivity
        // - Message Queue status
        // - Elasticsearch status

        // Determine overall status
        $allChecksOK = true;
        foreach ($result['checks'] as $check) {
            if ($check['status'] === 'FAIL') {
                $allChecksOK = false;
                break;
            }
        }

        if ($allChecksOK) {
            $result['status'] = 'OK';
            $httpStatusCode = 200;
        } else {
            $result['status'] = 'FAIL';
            $httpStatusCode = 503; // Service Unavailable
        }

        $resultJson = $this->resultJsonFactory->create();
        $resultJson->setData($result);
        $this->getResponse()->setHttpResponseCode($httpStatusCode);
        return $resultJson;
    }
}

After creating the module and its files, run:

php bin/magento setup:upgrade
php bin/magento cache:enable
php bin/magento cache:flush

You can then access this endpoint via curl http://your-magento-domain.com/healthcheck. The output will be JSON, indicating the status of each check.

Monitoring Script Integration:

#!/bin/bash

MAGENTO_URL="https://your-magento-domain.com"
HEALTHCHECK_ENDPOINT="${MAGENTO_URL}/healthcheck"
EXPECTED_HTTP_CODE=200
TIMEOUT_SECONDS=15

echo "Checking Magento application health endpoint at ${HEALTHCHECK_ENDPOINT}..."

HEALTH_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout ${TIMEOUT_SECONDS} --max-time ${TIMEOUT_SECONDS} ${HEALTHCHECK_ENDPOINT})

if [ "$HEALTH_RESPONSE" != "${EXPECTED_HTTP_CODE}" ]; then
    echo "ALERT: Magento healthcheck endpoint returned HTTP code ${HEALTH_RESPONSE} (expected ${EXPECTED_HTTP_CODE})."
    # Add alerting mechanism here
    exit 1
else
    # Optionally parse JSON response for deeper checks
    HEALTH_JSON=$(curl -s ${HEALTHCHECK_ENDPOINT})
    APP_STATUS=$(echo "${HEALTH_JSON}" | jq -r '.status') # Requires jq to be installed

    if [ "$APP_STATUS" != "OK" ]; then
        echo "ALERT: Magento application reported status: ${APP_STATUS}. Details: ${HEALTH_JSON}"
        # Add alerting mechanism here
        exit 1
    else
        echo "Magento application healthcheck passed. Status: OK"
    fi
fi

Google Cloud Monitoring Integration

Google Cloud’s operations suite (formerly Stackdriver) provides robust monitoring capabilities. We can leverage it to collect logs, metrics, and set up alerts.

Custom Metrics with Ops Agent

The Ops Agent is the recommended way to collect logs and metrics from your Compute Engine instances. We can configure it to collect custom metrics, such as the output of our health check scripts or specific application logs.

Install the Ops Agent: Follow the official Google Cloud documentation for installation on your VM instances.

Configure for Custom Metrics:

Edit the Ops Agent configuration file (e.g., /etc/google-cloud-ops-agent/config.yaml).

logging:
  receivers:
    mysql_replication_log:
      type: files
      include_paths:
        - /var/log/mysql_replication_check.log # Log from our script
  processors:
    # Example: Extracting lag from the log file
    extract_lag:
      type: regex_parser
      regex: '.*ALERT: Replication lag on (?P<host>[^ ]+) has exceeded the threshold.*Current lag: (?P<lag_seconds>\d+) seconds.*'
      # You might need a more sophisticated regex depending on your log format

metrics:
  # Collect metrics from the Ops Agent's built-in collectors
  # ...

  # Custom metrics collection using exec plugin
  # This is a more advanced way to collect metrics directly from scripts
  # For simpler cases, parsing logs is often sufficient.
  # Example: Running a script and collecting its output as a metric
  # This requires a script that outputs metrics in a Prometheus-compatible format or similar.
  # For this example, we'll focus on log-based metrics for simplicity.
  # If you need direct metric collection, consider Prometheus exporters or custom scripts
  # that write to a metrics endpoint scraped by the agent.

  # Example of collecting metrics from a script that outputs Prometheus format:
  # receivers:
  #   my_app_metrics:
  #     type: prometheus
  #     endpoint: http://localhost:9091/metrics # Assuming your app exposes metrics here
  #     collection_interval: 60

  # Example of using the exec plugin to run a script and parse its output
  # This is less common for general metrics but can be used for specific checks.
  # receivers:
  #   exec_healthcheck:
  #     type: exec
  #     command: "/opt/scripts/check_magento_health.sh --format=json" # Script outputs JSON
  #     interval: 60
  #     timeout: 10
  #     data_format: "json" # Or "text" if parsing manually
  # processors:
  #   parse_health_json:
  #     type: json_parser
  #     field: "status" # Example: extract the 'status' field
  #     metric_type: "gauge"
  #     metric_name: "magento/healthcheck/status" # 1 for OK, 0 for FAIL
  #     labels:
  #       host: "{{.host}}" # Example label

# --- IMPORTANT ---
# After editing config.yaml, restart the Ops Agent:
# sudo systemctl restart google-cloud-ops-agent

Once configured and the agent restarted, logs from /var/log/mysql_replication_check.log will be sent to Cloud Logging. If you configure processors, you can extract specific fields (like `lag_seconds`) and create custom metrics in Cloud Monitoring.

Setting up Cloud Monitoring Alerts

Navigate to the Google Cloud Console -> Monitoring -> Alerting.

Create Alerting Policies:

For MySQL Replication Lag:
- Select “Logs-based Metrics”.
- Create a new metric.
- Use the log filter to match your replication check logs (e.g., resource.type="gce_instance" AND logName="projects/YOUR_PROJECT_ID/logs/mysql_replication_check.log" AND textPayload=~"ALERT: Replication lag"*).
- Define the metric aggregation (e.g., count of log entries matching the alert).
- Create an alerting policy based on this new metric (e.g., trigger if count > 0 in 5 minutes).
For PHP-FPM Status:
- If you’ve configured Ops Agent to collect metrics from PHP-FPM’s status page (e.g., using the Prometheus receiver), create an alert based on metrics like php_fpm_process_manager_active_processes or custom metrics derived from the status page.
- Alternatively, use a log-based metric if your web server logs errors related to PHP-FPM.
For Magento Application Health:
- If using the custom health check endpoint and Ops Agent’s exec plugin or log parsing, create a log-based or metric-based alert. For example, alert if the magento/healthcheck/status metric drops to 0.

Configure notification channels (Email, PagerDuty, Slack) for these alerting policies.

Conclusion

A comprehensive monitoring strategy for Magento 2 on Google Cloud involves a multi-layered approach. Proactive detection of MySQL replication lag, continuous health checks of the web server and PHP-FPM, and application-level validation through custom endpoints are crucial. Integrating these checks with Google Cloud’s Ops Agent and Monitoring allows for centralized visibility and automated alerting, ensuring the stability and reliability of your e-commerce platform.