Disaster Recovery 101: Architecting Auto-Failovers for MySQL and C++ Deployments on AWS

Designing for High Availability: MySQL Replication and C++ Client Failover on AWS

Achieving true disaster recovery for critical applications hinges on robust, automated failover mechanisms. This post details a production-ready architecture for MySQL databases and their associated C++ clients deployed on AWS, focusing on automatic failover to minimize downtime.

MySQL Multi-Master Replication with AWS RDS and ProxySQL

For MySQL, a multi-master replication strategy is paramount. While traditional master-slave setups are common, they introduce single points of failure during failover. AWS RDS Multi-AZ offers a degree of high availability, but for true active-active or rapid automated failover, a more sophisticated approach is needed. We’ll leverage RDS Aurora for its built-in fault tolerance and then layer ProxySQL for intelligent connection routing and failover detection.

AWS RDS Aurora Configuration

Aurora’s distributed, fault-tolerant storage system inherently provides high availability. However, application-level failover logic is still required to direct traffic to the correct endpoint when an issue arises. We’ll assume an Aurora cluster with at least two writer instances (for Aurora, the concept of “writer” is dynamic and managed by Aurora itself, but for application logic, we need to identify a primary endpoint).

ProxySQL Deployment and Configuration

ProxySQL acts as a high-performance, high-availability proxy for MySQL. It can monitor backend servers, route traffic, and perform automatic failover. We’ll deploy ProxySQL in an active-passive or active-active configuration, often managed by an external orchestrator like Orchestrator or using AWS’s own services for high availability of the proxy layer itself (e.g., multiple EC2 instances behind an ELB).

ProxySQL Configuration File (`proxysql.cnf`)

The core of ProxySQL’s functionality lies in its configuration. We define hostgroups, servers within those hostgroups, and rules for traffic management.

[proxysql]
; General settings
datadir=/var/lib/proxysql
log_dir=/var/lib/proxysql/log
mysql-interfaces=0.0.0.0:3306
admin-interfaces=127.0.0.1:6032

[mysql_servers]
; Define your Aurora writer and reader endpoints here.
; For failover, we'll initially define both, but ProxySQL will manage which is active.
; Replace with your actual Aurora cluster endpoints.
writer_1:host=aurora-cluster-writer.cluster-xxxx.us-east-1.rds.amazonaws.com:3306:1
reader_1:host=aurora-cluster-reader.cluster-xxxx.us-east-1.rds.amazonaws.com:3306:1
writer_2:host=aurora-cluster-writer-replica.cluster-xxxx.us-east-1.rds.amazonaws.com:3306:1 ; Example of a secondary writer if Aurora supports it explicitly for your setup

[mysql_hostgroups]
; Hostgroup 10 for writers, 20 for readers.
10:writer_connections=100:max_replication_lag=10000:active=1
20:writer_connections=100:max_replication_lag=10000:active=1

[mysql_users]
; Define users for ProxySQL to authenticate with backend MySQL servers.
; Ensure these users have appropriate privileges.
admin:*:*:*:*:*:0:0:0:0:0:0:0:0:0:0:0:0:0:0

[mysql_replication_topology]
; Define the replication topology. ProxySQL will monitor this.
writer_1.mysql_hostgroups=10
reader_1.mysql_hostgroups=20
writer_2.mysql_hostgroups=10

[scheduler]
interval_ms=5000

[runtime_mysql_processlist]
max_rows=1000

[runtime_mysql_query_digest]
max_digest_length=1024

[runtime_mysql_query_rules]
; Rule to direct all traffic to hostgroup 10 (writers)
- rule_id=10
  description="Route all traffic to writer hostgroup"
  active=1
  priority=10
  destination_hostgroup=10
  apply=true

; Add more sophisticated rules as needed for read/write splitting.

After updating proxysql.cnf, reload ProxySQL’s configuration:

sudo systemctl restart proxysql
# Or via ProxySQL admin interface:
# PROXYSQL_ADMIN_USER="admin"
# PROXYSQL_ADMIN_PASS="admin"
# PROXYSQL_ADMIN_HOST="127.0.0.1"
# PROXYSQL_ADMIN_PORT="6032"
# mysql -h $PROXYSQL_ADMIN_HOST -P $PROXYSQL_ADMIN_PORT -u $PROXYSQL_ADMIN_USER -p$PROXYSQL_ADMIN_PASS < /etc/proxysql/mysql_real_servers.sql
# mysql -h $PROXYSQL_ADMIN_HOST -P $PROXYSQL_ADMIN_PORT -u $PROXYSQL_ADMIN_USER -p$PROXYSQL_ADMIN_PASS < /etc/proxysql/mysql_hostgroups.sql
# mysql -h $PROXYSQL_ADMIN_HOST -P $PROXYSQL_ADMIN_PORT -u $PROXYSQL_ADMIN_USER -p$PROXYSQL_ADMIN_PASS < /etc/proxysql/mysql_query_rules.sql
# mysql -h $PROXYSQL_ADMIN_HOST -P $PROXYSQL_ADMIN_PORT -u $PROXYSQL_ADMIN_USER -p$PROXYSQL_ADMIN_PASS -e "LOAD MYSQL SERVERS TO RUNTIME; LOAD MYSQL HOSTGROUPS TO RUNTIME; LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL SERVERS TO DISK; SAVE MYSQL HOSTGROUPS TO DISK; SAVE MYSQL QUERY RULES TO DISK;"

Automated Failover with ProxySQL and Health Checks

ProxySQL monitors the health of its backend servers. When a primary writer becomes unavailable, ProxySQL will automatically stop routing traffic to it and direct connections to a healthy writer. This is configured via the mysql_servers and mysql_hostgroups sections. The active flag in mysql_hostgroups is dynamically managed by ProxySQL based on its health checks.

C++ Client Auto-Failover Logic

The C++ application needs to be aware of the ProxySQL endpoint and handle connection errors gracefully. Instead of directly connecting to a single MySQL instance, it connects to the ProxySQL instance. When ProxySQL fails over, the C++ client will receive connection errors. The client must then implement retry logic and potentially switch to a secondary ProxySQL endpoint if a full ProxySQL cluster failover is suspected.

Connection Pooling and Error Handling

A robust C++ MySQL connector (like MySQL Connector/C++ or a custom solution) should be used. Implement a connection pool that can detect and handle broken connections.

#include <iostream>
#include <vector>
#include <mysql_connection.h>
#include <mysql_driver.h>
#include <cppconn/exception.h>
#include <cppconn/resultset.h>
#include <cppconn/statement.h>
#include <chrono>
#include <thread>

// Assume a simple connection pool structure
class MySQLConnectionPool {
private:
    sql::mysql::MySQL_Driver* driver;
    std::vector<sql::Connection*> connections;
    std::string host;
    std::string user;
    std::string password;
    int port;
    size_t pool_size;
    std::mutex pool_mutex;
    std::condition_variable pool_cv;
    std::vector<bool> in_use;

    // Fallback ProxySQL endpoints
    std::vector<std::pair<std::string, int>> fallback_proxysql_endpoints;
    int current_endpoint_index = 0;

public:
    MySQLConnectionPool(const std::string& h, int p, const std::string& u, const std::string& pw, size_t size, const std::vector<std::pair<std::string, int>>& fallback_endpoints)
        : host(h), port(p), user(u), password(pw), pool_size(size), fallback_proxysql_endpoints(fallback_endpoints) {
        driver = sql::mysql::get_mysql_driver_instance();
        in_use.resize(pool_size, false);
        for (size_t i = 0; i < pool_size; ++i) {
            connections.push_back(nullptr);
        }
        initialize_pool();
    }

    ~MySQLConnectionPool() {
        for (sql::Connection* conn : connections) {
            if (conn) {
                delete conn;
            }
        }
    }

    void initialize_pool() {
        for (size_t i = 0; i < pool_size; ++i) {
            try {
                connections[i] = driver->connect(host, user, password);
                connections[i]->setSchema("your_database_name"); // Set your database name
                std::cout << "Connection " << i << " established." << std::endl;
            } catch (sql::SQLException& e) {
                std::cerr << "Error initializing connection " << i << ": " << e.what() << std::endl;
                // Handle initialization failure, perhaps log and exit or try fallback
            }
        }
    }

    sql::Connection* get_connection() {
        std::unique_lock<std::mutex> lock(pool_mutex);
        pool_cv.wait(lock, [this] {
            for (size_t i = 0; i < pool_size; ++i) {
                if (!in_use[i] && connections[i] != nullptr && connections[i]->isValid()) {
                    return true;
                }
            }
            // If no valid connection is available, try to re-establish a broken one or switch endpoint
            if (try_reconnect_or_switch()) {
                return true; // A connection might be available now
            }
            return false; // Still no connection available
        });

        for (size_t i = 0; i < pool_size; ++i) {
            if (!in_use[i] && connections[i] != nullptr && connections[i]->isValid()) {
                in_use[i] = true;
                return connections[i];
            }
        }
        return nullptr; // Should not reach here if wait condition is correct
    }

    void release_connection(sql::Connection* conn) {
        std::unique_lock<std::mutex> lock(pool_mutex);
        for (size_t i = 0; i < pool_size; ++i) {
            if (connections[i] == conn) {
                in_use[i] = false;
                break;
            }
        }
        pool_cv.notify_one();
    }

    bool try_reconnect_or_switch() {
        // Check if any connection is broken and try to fix it
        for (size_t i = 0; i < pool_size; ++i) {
            if (connections[i] != nullptr && !connections[i]->isValid()) {
                std::cerr << "Connection " << i << " is broken. Attempting to reconnect..." << std::endl;
                delete connections[i];
                connections[i] = nullptr;
                try {
                    connections[i] = driver->connect(host, user, password);
                    connections[i]->setSchema("your_database_name");
                    std::cout << "Connection " << i << " re-established." << std::endl;
                    in_use[i] = false; // Mark as not in use after re-establishment
                    return true; // A connection is now available
                } catch (sql::SQLException& e) {
                    std::cerr << "Failed to reconnect connection " << i << ": " << e.what() << std::endl;
                    // If re-connection fails, consider switching the primary endpoint
                    if (!fallback_proxysql_endpoints.empty()) {
                        switch_to_next_endpoint();
                        return false; // Indicate that we need to try again with new endpoint
                    }
                }
            }
        }
        // If no broken connections found but still no available, try switching endpoint
        if (!fallback_proxysql_endpoints.empty()) {
            switch_to_next_endpoint();
            return false; // Indicate that we need to try again with new endpoint
        }
        return false; // No available connection and no fallback
    }

    void switch_to_next_endpoint() {
        current_endpoint_index = (current_endpoint_index + 1) % fallback_proxysql_endpoints.size();
        host = fallback_proxysql_endpoints[current_endpoint_index].first;
        port = fallback_proxysql_endpoints[current_endpoint_index].second;
        std::cerr << "Switching to fallback ProxySQL endpoint: " << host << ":" << port << std::endl;

        // Invalidate all existing connections and clear the pool to force re-initialization with the new endpoint
        for (size_t i = 0; i < pool_size; ++i) {
            if (connections[i]) {
                delete connections[i];
                connections[i] = nullptr;
            }
            in_use[i] = false;
        }
        initialize_pool(); // Re-initialize pool with the new host
    }

    // Example of executing a query
    void execute_query(const std::string& query) {
        sql::Connection* conn = nullptr;
        try {
            conn = get_connection();
            if (!conn) {
                throw std::runtime_error("Failed to get database connection.");
            }

            std::unique_ptr<sql::Statement> stmt(conn->createStatement());
            stmt->execute(query);
            std::cout << "Query executed successfully." << std::endl;

        } catch (sql::SQLException& e) {
            std::cerr << "SQL Error: " << e.what() << std::endl;
            // If a connection error occurs, mark it as potentially broken
            if (conn) {
                // This is a simplification; a real pool would track which connection object failed.
                // The try_reconnect_or_switch() will handle invalidating and re-establishing.
            }
            throw; // Re-throw to allow higher-level handling
        } catch (const std::runtime_error& e) {
            std::cerr << "Runtime Error: " << e.what() << std::endl;
            throw;
        } finally {
            if (conn) {
                release_connection(conn);
            }
        }
    }
};

int main() {
    // Primary ProxySQL endpoint
    std::string primary_host = "your_proxysql_host";
    int primary_port = 3306;

    // List of fallback ProxySQL endpoints for failover
    std::vector<std::pair<std::string, int>> fallback_endpoints = {
        {"your_proxysql_host_2", 3306},
        {"your_proxysql_host_3", 3306}
    };

    MySQLConnectionPool pool(primary_host, primary_port, "your_user", "your_password", 5, fallback_endpoints);

    try {
        pool.execute_query("INSERT INTO example_table (data) VALUES ('test_data')");
        // ... perform other database operations
    } catch (const std::exception& e) {
        std::cerr << "Application error: " << e.what() << std::endl;
        // Implement application-level retry or alert mechanisms here
    }

    return 0;
}

The MySQLConnectionPool class demonstrates:

Connection pooling to manage a set of database connections efficiently.
A get_connection() method that waits for an available, valid connection.
try_reconnect_or_switch() logic to detect broken connections, attempt re-establishment, and crucially, switch to a fallback ProxySQL endpoint if the primary becomes unresponsive.
The switch_to_next_endpoint() method invalidates existing connections and re-initializes the pool with the new endpoint.

External Monitoring and Orchestration

While ProxySQL handles database-level failover and the C++ client handles connection-level errors, a robust DR strategy often requires external orchestration. Tools like Orchestrator, MHA, or custom AWS Lambda functions triggered by CloudWatch alarms can monitor the health of the ProxySQL instances themselves. If a ProxySQL instance fails, these tools can initiate failover for the proxy layer, ensuring the C++ clients can connect to a healthy ProxySQL instance.

AWS Infrastructure Considerations

Deploying this architecture on AWS involves several key components:

AWS RDS Aurora: For managed, highly available MySQL. Ensure your cluster is configured for Multi-AZ and has appropriate instance types.
EC2 Instances for ProxySQL: Deploy ProxySQL on EC2 instances. For high availability of ProxySQL itself, consider deploying multiple instances behind an AWS Network Load Balancer (NLB) or Application Load Balancer (ALB). The NLB is often preferred for TCP-based services like MySQL.
Auto Scaling Groups: Use Auto Scaling Groups for your ProxySQL instances to ensure that if an instance fails, a new one is automatically launched.
CloudWatch Alarms: Configure CloudWatch alarms to monitor the health of your Aurora instances and ProxySQL instances. These alarms can trigger actions like Lambda functions to manage failover or notify operators.
IAM Roles: Use IAM roles for EC2 instances to grant them necessary permissions to interact with other AWS services (e.g., for Lambda functions to modify Route 53 records if using DNS-based failover).

DNS-Based Failover (Alternative/Complementary)

Instead of relying solely on ProxySQL's endpoint switching, you could use AWS Route 53 with health checks. Route 53 can monitor the health of your primary ProxySQL endpoint (or even the Aurora writer endpoint directly, though ProxySQL adds valuable abstraction). If the primary endpoint fails health checks, Route 53 can automatically update DNS records to point to a secondary ProxySQL endpoint. Your C++ clients would then connect to a stable DNS name, and Route 53 handles the underlying IP address change.

# Example Route 53 Health Check Configuration (Conceptual)
# This would typically be managed via AWS Console, CLI, or IaC tools like Terraform/CloudFormation.

# Health Check for Primary ProxySQL Endpoint
# Protocol: TCP
# Port: 3306
# Request Interval: 10 seconds
# Failure Threshold: 3

# Health Check for Secondary ProxySQL Endpoint
# ... similar configuration ...

# DNS Record Set (e.g., mysql.yourdomain.com)
# Type: A
# Alias: Yes
# Alias Target: Primary ProxySQL NLB/ELB
# Health Check ID: [ID of Primary ProxySQL Health Check]
# Set Failover: Yes
# Secondary Record Set:
# Alias Target: Secondary ProxySQL NLB/ELB
# Health Check ID: [ID of Secondary ProxySQL Health Check]

This approach simplifies client configuration but adds latency due to DNS propagation. It's often used in conjunction with ProxySQL for multi-layered resilience.

Conclusion

Architecting for automated failover requires a layered approach. By combining AWS RDS Aurora's inherent availability, ProxySQL's intelligent routing and health monitoring, and resilient C++ client logic with fallback endpoint support, you can build a highly available MySQL deployment. Augmenting this with external orchestration and potentially DNS-based failover provides a robust disaster recovery solution suitable for mission-critical applications.