Server Monitoring Best Practices: Keeping Your Ruby App and MySQL Clusters Alive on OVH
Proactive MySQL Cluster Health Checks with `pt-heartbeat`
Maintaining high availability for your Ruby application’s MySQL cluster on OVH requires more than just reactive alerts. Proactive monitoring of replication lag is paramount. We’ll leverage Percona Toolkit’s `pt-heartbeat` to continuously track replication lag and integrate it with a robust alerting system.
The core idea is to have a process on the primary MySQL server that periodically writes a timestamp to a dedicated table. A corresponding process on each replica then reads this timestamp and calculates the difference between the current time and the timestamp it read. This difference is our replication lag.
Setting up `pt-heartbeat` on the Primary
First, ensure Percona Toolkit is installed on your primary MySQL server. If not, you can typically install it via your distribution’s package manager (e.g., `apt-get install percona-toolkit` on Debian/Ubuntu, or `yum install percona-toolkit` on CentOS/RHEL). Create a dedicated user for `pt-heartbeat` with minimal privileges.
On your primary MySQL instance:
- Create a database and table for heartbeats:
CREATE DATABASE IF NOT EXISTS heartbeat; USE heartbeat; CREATE TABLE IF NOT EXISTS ping ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, server_id INT NOT NULL, ts DATETIME NOT NULL );
- Create a user for `pt-heartbeat`:
CREATE USER 'heartbeat'@'localhost' IDENTIFIED BY 'your_strong_password'; GRANT INSERT, SELECT ON heartbeat.ping TO 'heartbeat'@'localhost'; FLUSH PRIVILEGES;
Now, configure `pt-heartbeat` to run as a daemon on the primary. This script will periodically insert a new row into the `heartbeat.ping` table.
pt-heartbeat --daemon \ --interval 1 \ --host 127.0.0.1 \ --user heartbeat \ --password 'your_strong_password' \ --database heartbeat \ --table ping \ --update-primary \ --pid=/var/run/pt-heartbeat-primary.pid \ --log=/var/log/pt-heartbeat-primary.log \ --set-server-id
Explanation of key options:
--daemon: Runs `pt-heartbeat` in the background.--interval 1: Inserts a heartbeat every 1 second. Adjust based on your tolerance for very small lags.--update-primary: Automatically determines the primary server ID.--set-server-id: Sets the `server_id` column in the `ping` table.--pidand--log: Essential for daemon management and debugging.
Monitoring Replication Lag on Replicas
On each of your MySQL replica servers, you’ll run `pt-heartbeat` in a different mode to read the heartbeats and calculate the lag. You’ll need a dedicated user on the replicas as well.
On each replica MySQL instance:
- Create a user for `pt-heartbeat` (can be the same credentials as primary, but ensure it has SELECT access on the heartbeat database):
CREATE USER 'heartbeat'@'localhost' IDENTIFIED BY 'your_strong_password'; GRANT SELECT ON heartbeat.ping TO 'heartbeat'@'localhost'; FLUSH PRIVILEGES;
Now, run `pt-heartbeat` on each replica to monitor the lag. This command will output the calculated lag to standard output, making it easy to parse by monitoring agents.
pt-heartbeat --interval 5 \ --host 127.0.0.1 \ --user heartbeat \ --password 'your_strong_password' \ --database heartbeat \ --table ping \ --pid=/var/run/pt-heartbeat-replica.pid \ --log=/var/log/pt-heartbeat-replica.log \ --monitor
Explanation of key options:
--interval 5: Checks for heartbeats every 5 seconds. This is a good balance between responsiveness and reducing load.--monitor: This is the crucial flag that tells `pt-heartbeat` to calculate and output the replication lag.
Integrating with Prometheus and Alertmanager
The output of `pt-heartbeat –monitor` is a simple string representing the lag in seconds. We can use `node_exporter`’s textfile collector to expose this metric to Prometheus.
On each replica server:
- Create a script that runs `pt-heartbeat` and writes its output to a file in the `node_exporter`’s textfile collector directory (e.g.,
/var/lib/node_exporter/textfile_collector/).
#!/bin/bash
# Configuration
HEARTBEAT_CMD="pt-heartbeat --interval 5 --host 127.0.0.1 --user heartbeat --password 'your_strong_password' --database heartbeat --table ping --monitor"
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/mysql_replication_lag.prom"
LOG_FILE="/var/log/mysql_replication_lag_monitor.log"
ERROR_LOG_FILE="/var/log/mysql_replication_lag_monitor_error.log"
# Ensure the output directory exists
mkdir -p $(dirname "$OUTPUT_FILE")
# Run pt-heartbeat and capture output
LAG_SECONDS=$($HEARTBEAT_CMD 2>> "$ERROR_LOG_FILE")
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
# Format for Prometheus
echo "# HELP mysql_replication_lag_seconds MySQL replication lag in seconds." > "$OUTPUT_FILE"
echo "# TYPE mysql_replication_lag_seconds gauge" >> "$OUTPUT_FILE"
echo "mysql_replication_lag_seconds{replica_host=\"$(hostname -f)\"} $LAG_SECONDS" >> "$OUTPUT_FILE"
echo "$(date '+%Y-%m-%d %H:%M:%S') - Lag: ${LAG_SECONDS}s" >> "$LOG_FILE"
else
echo "$(date '+%Y-%m-%d %H:%M:%S') - Error running pt-heartbeat. Exit code: $EXIT_CODE" >> "$LOG_FILE"
# Optionally, write a specific error metric or leave the file as is
echo "# HELP mysql_replication_lag_error Error status of replication lag check." > "$OUTPUT_FILE"
echo "# TYPE mysql_replication_lag_error gauge" >> "$OUTPUT_FILE"
echo "mysql_replication_lag_error{replica_host=\"$(hostname -f)\"} 1" >> "$OUTPUT_FILE"
fi
exit 0
Make this script executable and set up a cron job to run it every minute (or more frequently if needed).
chmod +x /usr/local/bin/monitor_mysql_lag.sh
crontab -l | { cat; echo "*/1 * * * * /usr/local/bin/monitor_mysql_lag.sh"; } | crontab -
Ensure your `node_exporter` is configured to read from the textfile collector directory. In Prometheus, you’ll then scrape these metrics.
Prometheus and Alertmanager Configuration
In your Prometheus configuration (prometheus.yml), ensure your `node_exporter` targets are correctly set up to scrape the metrics exposed by the textfile collector.
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['replica1.yourdomain.com:9100', 'replica2.yourdomain.com:9100', 'replica3.yourdomain.com:9100']
Now, define alerting rules in Prometheus (e.g., in a file like alerts.yml) that will be sent to Alertmanager.
groups:
- name: mysql_replication_alerts
rules:
- alert: HighMySQLReplicationLag
expr: mysql_replication_lag_seconds > 60 # Alert if lag is greater than 60 seconds
for: 5m # Continue to fire for 5 minutes before sending
labels:
severity: critical
annotations:
summary: "High MySQL replication lag on {{ $labels.replica_host }}"
description: "MySQL replication lag on {{ $labels.replica_host }} has exceeded 60 seconds for 5 minutes. Current lag: {{ $value }}s."
- alert: MySQLReplicationLagging
expr: mysql_replication_lag_seconds > 30 # Alert if lag is greater than 30 seconds
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL replication lag detected on {{ $labels.replica_host }}"
description: "MySQL replication lag on {{ $labels.replica_host }} has exceeded 30 seconds for 2 minutes. Current lag: {{ $value }}s."
- alert: MySQLReplicationCheckFailed
expr: mysql_replication_lag_error == 1
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL replication lag check failed on {{ $labels.replica_host }}"
description: "The script to check MySQL replication lag on {{ $labels.replica_host }} failed to execute. Please investigate."
Ensure your Alertmanager configuration is set up to receive these alerts and route them to your desired notification channels (Slack, PagerDuty, email, etc.).
Ruby Application Health Checks and Load Balancer Integration
For your Ruby application, a simple HTTP health check endpoint is insufficient. We need to ensure the application can actually connect to and query the database. A common pattern is to have an endpoint that performs a lightweight database query.
In a Rails application, you might add a controller action like this:
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user! # If using Devise or similar
def show
begin
# Perform a simple, fast database query
# For ActiveRecord, this could be:
ActiveRecord::Base.connection.execute('SELECT 1')
render json: { status: 'ok', database: 'connected' }, status: :ok
rescue ActiveRecord::StatementInvalid => e
Rails.logger.error "Database connection check failed: #{e.message}"
render json: { status: 'error', database: 'disconnected', message: e.message }, status: :service_unavailable
rescue StandardError => e
Rails.logger.error "General health check error: #{e.message}"
render json: { status: 'error', message: e.message }, status: :internal_server_error
end
end
end
And the corresponding route:
# config/routes.rb Rails.application.routes.draw do get 'health', to: 'health#show' # ... other routes end
This endpoint should be configured in your load balancer (e.g., HAProxy, Nginx, or OVH’s load balancing service) to perform active health checks. If the endpoint returns a non-2xx status code, the load balancer should stop sending traffic to that instance.
HAProxy Health Check Configuration Example
If you’re using HAProxy, your frontend and backend configuration might look something like this:
frontend http_in
bind *:80
mode http
default_backend app_servers
backend app_servers
mode http
balance roundrobin
option httpchk GET /health HTTP/1.1\r\nHost:\ localhost
http-check expect status 200
server app1 192.168.1.10:3000 check
server app2 192.168.1.11:3000 check
server app3 192.168.1.12:3000 check
The option httpchk directive tells HAProxy to send a GET request to /health. The http-check expect status 200 ensures that only responses with a 200 OK status are considered healthy. If the Ruby app returns a 5xx error (due to database issues), HAProxy will mark that server as down.
OVH Specific Considerations
When operating on OVH, pay close attention to:
- Network Latency: Monitor latency between your application servers and your MySQL cluster. Use tools like
pingandmtrfrom your app servers to your database hosts. High latency can indirectly cause replication lag or application timeouts. - Firewall Rules: Ensure that your OVH firewall rules (both at the network level and within your instances) allow necessary traffic between your application servers and MySQL replicas/primary.
- Resource Utilization: Keep an eye on CPU, memory, and disk I/O on both your application servers and MySQL instances. OVH’s control panel and monitoring tools can provide insights. High resource contention can lead to performance degradation and affect health check reliability.
- Automated Backups: While not strictly monitoring, ensure your automated backup strategy for MySQL is robust and regularly tested. This is your last line of defense.
By combining proactive MySQL replication monitoring with application-level health checks integrated into your load balancer, you create a resilient system capable of self-healing and alerting you to potential issues before they impact your users.