Disaster Recovery 101: Architecting Auto-Failovers for MySQL and Ruby Deployments on DigitalOcean

Establishing a Highly Available MySQL Cluster with DigitalOcean Managed Databases

Achieving automated failover for a critical MySQL database on DigitalOcean necessitates a robust, multi-node architecture. DigitalOcean’s Managed Databases for MySQL offer a managed solution that simplifies this complexity. We’ll focus on configuring a read-replica setup that can be promoted to primary in the event of a failure, minimizing downtime.

The core principle here is to have a primary database instance and at least one read replica. In a disaster scenario, the read replica is promoted to become the new primary. This process can be automated using external monitoring and orchestration tools.

Provisioning DigitalOcean Managed MySQL

First, provision a Managed MySQL cluster. For high availability, select a cluster with at least two nodes (one primary, one replica). The choice of node size depends on your application’s read/write load and data volume. Ensure you select a region that aligns with your application’s deployment for reduced latency.

You can achieve this via the DigitalOcean Control Panel or programmatically using the `doctl` CLI or the DigitalOcean API. Here’s an example using `doctl` to create a 2-node MySQL cluster:

doctl databases create --engine mysql --version 8.0 --size db-s-2vcpu-4gb --region nyc3 --name my-ha-mysql --nodes 2 --firewall-rules "0.0.0.0/0" --password "YOUR_SECURE_PASSWORD"

Replace db-s-2vcpu-4gb with your desired node size and nyc3 with your preferred region. --firewall-rules "0.0.0.0/0" is a broad example; in production, restrict this to your application’s IP addresses.

Configuring Application Connection Strings for Failover

Your Ruby application needs to be aware of the primary and replica endpoints. DigitalOcean Managed Databases provide distinct connection strings for the primary and read-only replicas. The application’s database adapter (e.g., ActiveRecord) must be configured to use these endpoints.

A common strategy is to configure your application to connect to the primary endpoint for writes and a separate read-only endpoint for reads. For failover, the application will attempt to connect to the primary. If that fails, it should be able to gracefully switch to the promoted replica.

In a typical Rails application, this would be managed within config/database.yml. For automated failover, you’ll need a mechanism to update the primary connection string when a failover occurs.

Here’s a conceptual example of how you might structure your database.yml to support multiple replicas (though for automated failover, we’ll focus on a single primary and a single replica that can be promoted):

production:
  adapter: mysql2
  encoding: utf8mb4
  pool: 5
  host: &primary_host <%-- This will be dynamically updated --%>
  username: <%= ENV['DB_USERNAME'] %>
  password: <%= ENV['DB_PASSWORD'] %>
  database: &primary_database <%-- This will be dynamically updated --%>

  replica:
    host: &replica_host <%-- This will be dynamically updated --%>
    username: <%= ENV['DB_REPLICA_USERNAME'] %>
    password: <%= ENV['DB_REPLICA_PASSWORD'] %>
    database: &replica_database <%-- This will be dynamically updated --%>

The key challenge is dynamically updating the host and database values for the primary connection when a failover occurs. This is where external orchestration comes in.

Automating Failover with External Orchestration

DigitalOcean Managed Databases do not automatically reconfigure your application’s connection strings upon failover. You need an external system to detect the primary database failure and then update your application’s configuration or trigger a re-deployment with the new primary endpoint.

A common approach involves a monitoring service that periodically checks the health of the primary database. If the primary becomes unresponsive, the monitoring service initiates a failover process.

Monitoring the Primary Database

We can use a simple script running on a separate DigitalOcean Droplet or a dedicated monitoring service. This script will attempt to connect to the primary database and execute a simple query. If the query fails or times out, it signals a potential outage.

Here’s a Python script using `mysql.connector` to check the primary’s health:

import mysql.connector
import os
import time
import requests

PRIMARY_HOST = os.environ.get("DB_PRIMARY_HOST")
PRIMARY_USER = os.environ.get("DB_USER")
PRIMARY_PASSWORD = os.environ.get("DB_PASSWORD")
PRIMARY_DB = os.environ.get("DB_NAME")
REPLICA_HOST = os.environ.get("DB_REPLICA_HOST") # To be promoted
REPLICA_USER = os.environ.get("DB_REPLICA_USER")
REPLICA_PASSWORD = os.environ.get("DB_REPLICA_PASSWORD")
REPLICA_DB = os.environ.get("DB_REPLICA_NAME")
FAILOVER_TRIGGER_URL = os.environ.get("FAILOVER_TRIGGER_URL") # Endpoint to signal failover

def check_primary_health():
    try:
        conn = mysql.connector.connect(
            host=PRIMARY_HOST,
            user=PRIMARY_USER,
            password=PRIMARY_PASSWORD,
            database=PRIMARY_DB,
            connection_timeout=5
        )
        cursor = conn.cursor()
        cursor.execute("SELECT 1")
        cursor.close()
        conn.close()
        print("Primary database is healthy.")
        return True
    except mysql.connector.Error as err:
        print(f"Error connecting to primary database: {err}")
        return False

if __name__ == "__main__":
    if not all([PRIMARY_HOST, PRIMARY_USER, PRIMARY_PASSWORD, PRIMARY_DB, REPLICA_HOST, REPLICA_USER, REPLICA_PASSWORD, REPLICA_DB, FAILOVER_TRIGGER_URL]):
        print("Error: Missing environment variables for database connection or failover trigger.")
        exit(1)

    while True:
        if not check_primary_health():
            print("Primary database is down. Initiating failover process...")
            try:
                # Signal an external service to perform the failover
                response = requests.post(FAILOVER_TRIGGER_URL, json={
                    "old_primary_host": PRIMARY_HOST,
                    "new_primary_host": REPLICA_HOST, # This is the host that *will be* promoted
                    "replica_credentials": {
                        "user": REPLICA_USER,
                        "password": REPLICA_PASSWORD,
                        "database": REPLICA_DB
                    }
                })
                response.raise_for_status() # Raise an exception for bad status codes
                print("Failover trigger signal sent successfully.")
                # In a real scenario, you might want to exit or wait for confirmation
                break
            except requests.exceptions.RequestException as e:
                print(f"Failed to send failover trigger signal: {e}")
                # Implement retry logic or alerting here
                time.sleep(60) # Wait before retrying to send signal
        time.sleep(30) # Check every 30 seconds

This script should be run as a cron job or a systemd service on a separate Droplet. Crucially, it needs access to the DigitalOcean API to trigger the actual database promotion.

Triggering Database Promotion via DigitalOcean API

When the monitoring script detects a failure, it needs to call the DigitalOcean API to promote the read replica to become the new primary. This requires a DigitalOcean API token with sufficient permissions.

You can create a Personal Access Token in your DigitalOcean account settings under “API” -> “Tokens/Keys”. Store this token securely (e.g., in environment variables or a secrets manager).

The API endpoint for managing database replicas is /v2/databases/{database_cluster_uuid}/replicas/{replica_uuid}/promote. You’ll need to identify the UUIDs of your database cluster and the replica you wish to promote.

Here’s a Python Flask application snippet that acts as the FAILOVER_TRIGGER_URL endpoint. This service receives the signal from the monitoring script and interacts with the DigitalOcean API.

from flask import Flask, request, jsonify
import requests
import os

app = Flask(__name__)

DO_API_TOKEN = os.environ.get("DO_API_TOKEN")
DATABASE_CLUSTER_UUID = os.environ.get("DATABASE_CLUSTER_UUID")
REPLICA_TO_PROMOTE_UUID = os.environ.get("REPLICA_TO_PROMOTE_UUID") # UUID of the read replica

DO_API_URL = "https://api.digitalocean.com/v2"

@app.route('/trigger-failover', methods=['POST'])
def trigger_failover():
    data = request.get_json()
    old_primary_host = data.get('old_primary_host')
    new_primary_host = data.get('new_primary_host') # This is the host that *will be* promoted
    replica_credentials = data.get('replica_credentials')

    if not all([old_primary_host, new_primary_host, replica_credentials, DO_API_TOKEN, DATABASE_CLUSTER_UUID, REPLICA_TO_PROMOTE_UUID]):
        return jsonify({"error": "Missing required parameters or configuration"}), 400

    print(f"Received failover request: Old Primary={old_primary_host}, Target New Primary={new_primary_host}")

    # 1. Promote the replica to primary
    promote_url = f"{DO_API_URL}/databases/{DATABASE_CLUSTER_UUID}/replicas/{REPLICA_TO_PROMOTE_UUID}/promote"
    headers = {
        "Authorization": f"Bearer {DO_API_TOKEN}",
        "Content-Type": "application/json"
    }

    try:
        response = requests.post(promote_url, headers=headers)
        response.raise_for_status()
        print(f"Successfully sent promote request to DigitalOcean API. Response: {response.json()}")

        # 2. Update application configuration (e.g., re-deploy or update config files)
        # This is a placeholder. In a real system, you'd integrate with your deployment pipeline.
        print("Initiating application configuration update...")
        update_application_config(new_primary_host, replica_credentials)

        return jsonify({"message": "Failover initiated successfully"}), 200

    except requests.exceptions.RequestException as e:
        print(f"Error calling DigitalOcean API: {e}")
        return jsonify({"error": f"Failed to promote replica: {e}"}), 500

def update_application_config(new_primary_host, replica_credentials):
    # This function needs to be implemented based on your deployment strategy.
    # Options include:
    # - Triggering a CI/CD pipeline (e.g., Jenkins, GitLab CI, GitHub Actions) to re-deploy.
    # - SSHing into application servers and updating configuration files/environment variables.
    # - Using an orchestration tool like Ansible or Chef.
    print(f"Placeholder: Updating application config to use new primary host: {new_primary_host}")
    print(f"Replica credentials: {replica_credentials}")
    # Example: Triggering a deployment script
    # os.system("deploy_script.sh")

if __name__ == '__main__':
    # For development, run with: python app.py
    # For production, use a WSGI server like Gunicorn: gunicorn -w 4 app:app
    app.run(host='0.0.0.0', port=5000)

This Flask service needs to be deployed on a Droplet accessible by your monitoring script. Ensure the environment variables (DO_API_TOKEN, DATABASE_CLUSTER_UUID, REPLICA_TO_PROMOTE_UUID) are set correctly.

Updating Application Configuration Post-Failover

The most critical part of automated failover is updating your application’s database connection details. The update_application_config function in the Flask example is a placeholder. Here are common strategies:

CI/CD Pipeline Integration: The failover trigger service can signal your CI/CD pipeline (e.g., Jenkins, GitLab CI, GitHub Actions) to re-deploy the application with updated configuration. This is often the cleanest approach. The pipeline would fetch the new primary endpoint from the DigitalOcean API or a central configuration store.
Configuration Management Tools: Use tools like Ansible, Chef, or Puppet. The failover trigger service could execute an Ansible playbook that updates configuration files (e.g., database.yml, .env files) on your application servers and restarts the application processes.
Direct Configuration Update: For simpler setups, the failover trigger service could SSH into application servers, update configuration files, and restart the application. This is less robust and harder to manage at scale.

When updating your application’s configuration, ensure you are updating the connection details for the primary database. If your application uses read replicas, you might also need to update those connection strings if the old primary was also serving read traffic.

Testing Your Failover Strategy

Thorough testing is paramount. Simulate failures to ensure your automated process works as expected:

Simulate Primary Failure: Manually stop the primary database instance (if possible through the DO API/CLI for testing) or block network access to it from your monitoring script’s perspective. Verify that the monitoring script detects the failure and triggers the failover process.
Verify Promotion: Check the DigitalOcean control panel or API to confirm that the replica has been promoted to primary.
Application Reconfiguration: Ensure your application successfully updates its configuration and reconnects to the new primary database. Monitor application logs for connection errors and successful operations.
Data Consistency: After failover, perform checks to ensure data consistency between the old primary (if it comes back online) and the new primary.
Rollback Scenario: Test what happens if the failover trigger fails or if the application reconfiguration fails. Have manual fallback procedures in place.

Considerations for Ruby Deployments

For Ruby on Rails applications, managing database credentials and triggering re-deployments are key. Tools like Capistrano can be leveraged to automate configuration updates and application restarts after a failover signal is received.

Ensure your application’s database adapter (e.g., mysql2 gem) is configured to handle connection drops gracefully and retry connections to the new primary. ActiveRecord’s connection pooling and retry mechanisms can help, but they are not a substitute for a robust failover strategy.

When updating config/database.yml, consider using environment variables that are dynamically set during deployment. For example, your CI/CD pipeline can fetch the current primary endpoint from DigitalOcean and inject it as an environment variable before deploying.

Advanced Strategies and Alternatives

While the above describes a common automated failover pattern, consider these advanced points:

Multi-Region Failover: For even higher availability, consider replicating your database across different DigitalOcean regions. This is more complex and involves asynchronous replication, which can lead to data staleness.
Managed Kubernetes with Operators: If you’re using Kubernetes, database operators (like Percona XtraDB Cluster Operator or Vitess) can manage high availability and failover more natively within the Kubernetes ecosystem.
External Load Balancers: For read traffic, you can use DigitalOcean Load Balancers to distribute traffic across multiple read replicas. In a failover scenario, you might need to update the load balancer’s backend pool.
Database Proxies: Tools like ProxySQL can sit in front of your MySQL cluster, intelligently routing traffic and handling failover detection and redirection. This can abstract some of the complexity from the application.

Architecting for automated failover is an ongoing process. Regularly review your monitoring, alerting, and failover procedures to ensure they remain effective as your application and infrastructure evolve.