Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on OVH

Proactive MySQL Cluster Health Checks with `pt-heartbeat`

Maintaining high availability for your Ruby application’s MySQL cluster on OVH requires more than just reactive alerts. Proactive monitoring of replication lag is paramount. We’ll leverage Percona Toolkit’s `pt-heartbeat` to continuously track replication lag and integrate it with a robust alerting system.

The core idea is to have a process on the primary MySQL server that periodically writes a timestamp to a dedicated table. A corresponding process on each replica then reads this timestamp and calculates the difference between the current time and the timestamp it read. This difference is our replication lag.

Setting up `pt-heartbeat` on the Primary

First, ensure Percona Toolkit is installed on your primary MySQL server. If not, you can typically install it via your distribution’s package manager (e.g., `apt-get install percona-toolkit` on Debian/Ubuntu, or `yum install percona-toolkit` on CentOS/RHEL). Create a dedicated user for `pt-heartbeat` with minimal privileges.

On your primary MySQL instance:

Create a database and table for heartbeats:

CREATE DATABASE IF NOT EXISTS heartbeat;
USE heartbeat;
CREATE TABLE IF NOT EXISTS ping (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  server_id INT NOT NULL,
  ts DATETIME NOT NULL
);

Create a user for `pt-heartbeat`:

CREATE USER 'heartbeat'@'localhost' IDENTIFIED BY 'your_strong_password';
GRANT INSERT, SELECT ON heartbeat.ping TO 'heartbeat'@'localhost';
FLUSH PRIVILEGES;

Now, configure `pt-heartbeat` to run as a daemon on the primary. This script will periodically insert a new row into the `heartbeat.ping` table.

pt-heartbeat --daemon \
  --interval 1 \
  --host 127.0.0.1 \
  --user heartbeat \
  --password 'your_strong_password' \
  --database heartbeat \
  --table ping \
  --update-primary \
  --pid=/var/run/pt-heartbeat-primary.pid \
  --log=/var/log/pt-heartbeat-primary.log \
  --set-server-id

Explanation of key options:

--daemon: Runs `pt-heartbeat` in the background.
--interval 1: Inserts a heartbeat every 1 second. Adjust based on your tolerance for very small lags.
--update-primary: Automatically determines the primary server ID.
--set-server-id: Sets the `server_id` column in the `ping` table.
--pid and --log: Essential for daemon management and debugging.

Monitoring Replication Lag on Replicas

On each of your MySQL replica servers, you’ll run `pt-heartbeat` in a different mode to read the heartbeats and calculate the lag. You’ll need a dedicated user on the replicas as well.

On each replica MySQL instance:

Create a user for `pt-heartbeat` (can be the same credentials as primary, but ensure it has SELECT access on the heartbeat database):

CREATE USER 'heartbeat'@'localhost' IDENTIFIED BY 'your_strong_password';
GRANT SELECT ON heartbeat.ping TO 'heartbeat'@'localhost';
FLUSH PRIVILEGES;

Now, run `pt-heartbeat` on each replica to monitor the lag. This command will output the calculated lag to standard output, making it easy to parse by monitoring agents.

pt-heartbeat --interval 5 \
  --host 127.0.0.1 \
  --user heartbeat \
  --password 'your_strong_password' \
  --database heartbeat \
  --table ping \
  --pid=/var/run/pt-heartbeat-replica.pid \
  --log=/var/log/pt-heartbeat-replica.log \
  --monitor

Explanation of key options:

--interval 5: Checks for heartbeats every 5 seconds. This is a good balance between responsiveness and reducing load.
--monitor: This is the crucial flag that tells `pt-heartbeat` to calculate and output the replication lag.

Integrating with Prometheus and Alertmanager

The output of `pt-heartbeat –monitor` is a simple string representing the lag in seconds. We can use `node_exporter`’s textfile collector to expose this metric to Prometheus.

On each replica server:

Create a script that runs `pt-heartbeat` and writes its output to a file in the `node_exporter`’s textfile collector directory (e.g., /var/lib/node_exporter/textfile_collector/).

#!/bin/bash

# Configuration
HEARTBEAT_CMD="pt-heartbeat --interval 5 --host 127.0.0.1 --user heartbeat --password 'your_strong_password' --database heartbeat --table ping --monitor"
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/mysql_replication_lag.prom"
LOG_FILE="/var/log/mysql_replication_lag_monitor.log"
ERROR_LOG_FILE="/var/log/mysql_replication_lag_monitor_error.log"

# Ensure the output directory exists
mkdir -p $(dirname "$OUTPUT_FILE")

# Run pt-heartbeat and capture output
LAG_SECONDS=$($HEARTBEAT_CMD 2>> "$ERROR_LOG_FILE")
EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then
    # Format for Prometheus
    echo "# HELP mysql_replication_lag_seconds MySQL replication lag in seconds." > "$OUTPUT_FILE"
    echo "# TYPE mysql_replication_lag_seconds gauge" >> "$OUTPUT_FILE"
    echo "mysql_replication_lag_seconds{replica_host=\"$(hostname -f)\"} $LAG_SECONDS" >> "$OUTPUT_FILE"
    echo "$(date '+%Y-%m-%d %H:%M:%S') - Lag: ${LAG_SECONDS}s" >> "$LOG_FILE"
else
    echo "$(date '+%Y-%m-%d %H:%M:%S') - Error running pt-heartbeat. Exit code: $EXIT_CODE" >> "$LOG_FILE"
    # Optionally, write a specific error metric or leave the file as is
    echo "# HELP mysql_replication_lag_error Error status of replication lag check." > "$OUTPUT_FILE"
    echo "# TYPE mysql_replication_lag_error gauge" >> "$OUTPUT_FILE"
    echo "mysql_replication_lag_error{replica_host=\"$(hostname -f)\"} 1" >> "$OUTPUT_FILE"
fi

exit 0

Make this script executable and set up a cron job to run it every minute (or more frequently if needed).

chmod +x /usr/local/bin/monitor_mysql_lag.sh
crontab -l | { cat; echo "*/1 * * * * /usr/local/bin/monitor_mysql_lag.sh"; } | crontab -

Ensure your `node_exporter` is configured to read from the textfile collector directory. In Prometheus, you’ll then scrape these metrics.

Prometheus and Alertmanager Configuration

In your Prometheus configuration (prometheus.yml), ensure your `node_exporter` targets are correctly set up to scrape the metrics exposed by the textfile collector.

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['replica1.yourdomain.com:9100', 'replica2.yourdomain.com:9100', 'replica3.yourdomain.com:9100']

Now, define alerting rules in Prometheus (e.g., in a file like alerts.yml) that will be sent to Alertmanager.

groups:
- name: mysql_replication_alerts
  rules:
  - alert: HighMySQLReplicationLag
    expr: mysql_replication_lag_seconds > 60  # Alert if lag is greater than 60 seconds
    for: 5m  # Continue to fire for 5 minutes before sending
    labels:
      severity: critical
    annotations:
      summary: "High MySQL replication lag on {{ $labels.replica_host }}"
      description: "MySQL replication lag on {{ $labels.replica_host }} has exceeded 60 seconds for 5 minutes. Current lag: {{ $value }}s."

  - alert: MySQLReplicationLagging
    expr: mysql_replication_lag_seconds > 30  # Alert if lag is greater than 30 seconds
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "MySQL replication lag detected on {{ $labels.replica_host }}"
      description: "MySQL replication lag on {{ $labels.replica_host }} has exceeded 30 seconds for 2 minutes. Current lag: {{ $value }}s."

  - alert: MySQLReplicationCheckFailed
    expr: mysql_replication_lag_error == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MySQL replication lag check failed on {{ $labels.replica_host }}"
      description: "The script to check MySQL replication lag on {{ $labels.replica_host }} failed to execute. Please investigate."

Ensure your Alertmanager configuration is set up to receive these alerts and route them to your desired notification channels (Slack, PagerDuty, email, etc.).

Ruby Application Health Checks and Load Balancer Integration

For your Ruby application, a simple HTTP health check endpoint is insufficient. We need to ensure the application can actually connect to and query the database. A common pattern is to have an endpoint that performs a lightweight database query.

In a Rails application, you might add a controller action like this:

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user! # If using Devise or similar

  def show
    begin
      # Perform a simple, fast database query
      # For ActiveRecord, this could be:
      ActiveRecord::Base.connection.execute('SELECT 1')
      render json: { status: 'ok', database: 'connected' }, status: :ok
    rescue ActiveRecord::StatementInvalid => e
      Rails.logger.error "Database connection check failed: #{e.message}"
      render json: { status: 'error', database: 'disconnected', message: e.message }, status: :service_unavailable
    rescue StandardError => e
      Rails.logger.error "General health check error: #{e.message}"
      render json: { status: 'error', message: e.message }, status: :internal_server_error
    end
  end
end

And the corresponding route:

# config/routes.rb
Rails.application.routes.draw do
  get 'health', to: 'health#show'
  # ... other routes
end

This endpoint should be configured in your load balancer (e.g., HAProxy, Nginx, or OVH’s load balancing service) to perform active health checks. If the endpoint returns a non-2xx status code, the load balancer should stop sending traffic to that instance.

HAProxy Health Check Configuration Example

If you’re using HAProxy, your frontend and backend configuration might look something like this:

frontend http_in
    bind *:80
    mode http
    default_backend app_servers

backend app_servers
    mode http
    balance roundrobin
    option httpchk GET /health HTTP/1.1\r\nHost:\ localhost
    http-check expect status 200
    server app1 192.168.1.10:3000 check
    server app2 192.168.1.11:3000 check
    server app3 192.168.1.12:3000 check

The option httpchk directive tells HAProxy to send a GET request to /health. The http-check expect status 200 ensures that only responses with a 200 OK status are considered healthy. If the Ruby app returns a 5xx error (due to database issues), HAProxy will mark that server as down.

OVH Specific Considerations

When operating on OVH, pay close attention to:

Network Latency: Monitor latency between your application servers and your MySQL cluster. Use tools like ping and mtr from your app servers to your database hosts. High latency can indirectly cause replication lag or application timeouts.
Firewall Rules: Ensure that your OVH firewall rules (both at the network level and within your instances) allow necessary traffic between your application servers and MySQL replicas/primary.
Resource Utilization: Keep an eye on CPU, memory, and disk I/O on both your application servers and MySQL instances. OVH’s control panel and monitoring tools can provide insights. High resource contention can lead to performance degradation and affect health check reliability.
Automated Backups: While not strictly monitoring, ensure your automated backup strategy for MySQL is robust and regularly tested. This is your last line of defense.

By combining proactive MySQL replication monitoring with application-level health checks integrated into your load balancer, you create a resilient system capable of self-healing and alerting you to potential issues before they impact your users.

Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on OVH

Proactive MySQL Cluster Health Checks with `pt-heartbeat`

Setting up `pt-heartbeat` on the Primary

Monitoring Replication Lag on Replicas

Integrating with Prometheus and Alertmanager

Prometheus and Alertmanager Configuration

Ruby Application Health Checks and Load Balancer Integration

HAProxy Health Check Configuration Example

OVH Specific Considerations

Recent Posts

Top Categories

Our Products

Our Services