Disaster Recovery 101: Architecting Auto-Failovers for MySQL and C Deployments on OVH

Establishing a High-Availability MySQL Cluster with Automatic Failover

Achieving true disaster recovery for critical databases like MySQL necessitates an automated failover strategy. Relying on manual intervention during an outage is a recipe for extended downtime and significant business impact. This section details the architecture and implementation of a highly available MySQL cluster on OVH, leveraging synchronous replication and an intelligent orchestrator for seamless failover.

We will focus on a Percona XtraDB Cluster (PXC) deployment. PXC utilizes Galera Cluster, a synchronous multi-master replication technology, which provides strong consistency and automatic node provisioning. This eliminates the complexities of asynchronous replication lag and manual failover scripts.

OVH Infrastructure Considerations

For optimal performance and resilience, deploy PXC nodes across different Availability Zones (AZs) within an OVH region. This ensures that a failure of an entire data center does not bring down the entire cluster. Network latency between AZs is a critical factor for synchronous replication; OVH’s robust internal network generally provides low latency, making this feasible.

Each node should have dedicated block storage (e.g., OVH’s SSD Block Storage) for optimal I/O performance. Network configuration should ensure high bandwidth and low latency between nodes. Consider using OVH’s private network capabilities for inter-node communication to enhance security and performance.

Percona XtraDB Cluster Deployment and Configuration

A typical PXC cluster consists of at least three nodes to maintain quorum. We’ll outline the configuration for a single node, which can then be replicated across the cluster.

Prerequisites:

A Linux distribution (e.g., Ubuntu 20.04 LTS) on each OVH instance.
Root or sudo access.
Firewall rules allowing MySQL (3306), Galera replication (4567, 4568), and IST/SST ports (4444).

Installation (on each node):

First, add the Percona repository:

wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb
sudo apt-get update

Install Percona XtraDB Cluster:

sudo apt-get install percona-xtradb-cluster

Configuration (on each node):

The primary configuration file is /etc/mysql/my.cnf. We need to configure it for Galera replication. Below is a sample configuration for the first node. Note the use of private IP addresses for inter-node communication.

[mysqld]
server-id = 1
datadir=/var/lib/mysql
socket=/var/run/mysqld/mysqld.sock
log-error=/var/log/mysql/error.log
pid-file=/var/run/mysqld/mysqld.pid

# PXC specific settings
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="my_pxc_cluster"
wsrep_cluster_address="gcomm://10.0.0.1,10.0.0.2,10.0.0.3"  # Replace with private IPs of all nodes
wsrep_node_name="pxc-node-1"
wsrep_node_address="10.0.0.1"  # This node's private IP
wsrep_sst_method=xtrabackup
wsrep_sst_auth="sstuser:your_sst_password"

# InnoDB settings
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_buffer_pool_size=4G  # Adjust based on instance RAM

# Other MySQL settings
bind-address=0.0.0.0
skip-name-resolve
max_connections=200

For subsequent nodes, increment server-id and update wsrep_node_name and wsrep_node_address accordingly. The wsrep_cluster_address should list all nodes in the cluster.

Initial Cluster Bootstrap:

Start the first node with the bootstrap option:

sudo systemctl start [email protected]

Once the first node is up and running, start the remaining nodes normally:

sudo systemctl start mysql.service

Verify cluster status:

SHOW STATUS LIKE 'wsrep_cluster_size';

This should show the total number of nodes in the cluster.

Implementing Automatic Failover with ProxySQL

While PXC provides high availability at the database layer, applications need a robust way to connect to a healthy node and automatically switch if a node fails. ProxySQL is an excellent choice for this. It’s a high-performance, high-availability, and scalable MySQL proxy that sits between your application and your database cluster.

ProxySQL can monitor the health of PXC nodes and route traffic accordingly. It also supports read/write splitting and query caching.

ProxySQL Installation and Configuration

Install ProxySQL on separate instances, ideally co-located with your application servers or in a dedicated proxy tier. For high availability of the proxy layer itself, deploy at least two ProxySQL instances.

# On Ubuntu/Debian
sudo apt-get install proxysql

ProxySQL Configuration (/etc/proxysql.cnf):

[proxysql]
; General settings
log_error=/var/log/proxysql/proxysql.log
mysql-interfaces=0.0.0.0:6033
admin-interfaces=0.0.0.0:6032

[mysql_servers]
# PXC Node 1 (Primary/Writer)
10.0.0.1:3306
# PXC Node 2 (Replica/Reader)
10.0.0.2:3306
# PXC Node 3 (Replica/Reader)
10.0.0.3:3306

[mysql_users]
# Application user
app_user:app_password:1:1000

[mysql_replication_hostgroups]
# Writer hostgroup
1:1000
# Reader hostgroup
2:1000

[mysql_replication_topology]
# Define the primary writer and its readers
1>2,3

[mysql_query_rules]
# Route all queries to the writer hostgroup by default
- match_digest=".*" group=1

# Example: Route SELECT queries to reader hostgroup if available
# - match_digest="SELECT.*" group=2

ProxySQL Administration and Health Checks:

ProxySQL uses its admin interface (default 6032) to manage its configuration and monitor nodes. You can connect to it using the mysql client:

mysql -u admin -padmin -h 127.0.0.1 -P 6032

Load the configuration and start the monitoring:

LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
LOAD MYSQL REPLICATION TOPOLOGIES TO RUNTIME;
SET mysql-monitor_ping_interval_ms = 1000;
SET mysql-monitor_connect_interval_ms = 1000;
SET mysql-monitor_read_only_interval_ms = 1000;
PROXYSql_RELOAD_CONFIGURATION();

ProxySQL will now monitor the health of the PXC nodes. If a node becomes unhealthy (e.g., fails to respond to pings or connections), ProxySQL will automatically remove it from the active pool and direct traffic to the remaining healthy nodes. When the failed node recovers, ProxySQL will detect it and re-integrate it into the cluster.

Automating Failover Orchestration

While ProxySQL handles the immediate failover by rerouting traffic, a complete disaster recovery strategy might involve more complex orchestration, especially if an entire AZ becomes unavailable. For this, we can leverage tools like Ansible or custom scripts that monitor cluster health and can trigger actions.

Ansible Playbook for PXC Node Management

An Ansible playbook can automate the process of starting, stopping, and checking the status of PXC nodes. It can also be used to reconfigure nodes or perform SSTs if needed.

---
- name: Manage Percona XtraDB Cluster Nodes
  hosts: pxc_cluster
  become: yes
  vars:
    pxc_cluster_nodes:
      - "10.0.0.1"
      - "10.0.0.2"
      - "10.0.0.3"
    pxc_cluster_name: "my_pxc_cluster"
    pxc_sst_user: "sstuser"
    pxc_sst_password: "your_sst_password"

  tasks:
    - name: Ensure Percona XtraDB Cluster is installed
      apt:
        name: percona-xtradb-cluster
        state: present
        update_cache: yes

    - name: Configure PXC node
      template:
        src: templates/my.cnf.j2
        dest: /etc/mysql/my.cnf
      notify: Restart MySQL

    - name: Ensure MySQL service is running and enabled
      systemd:
        name: mysql
        state: started
        enabled: yes

  handlers:
    - name: Restart MySQL
      systemd:
        name: mysql
        state: restarted

The templates/my.cnf.j2 file would dynamically generate the my.cnf based on the node’s IP and its position in the cluster.

Disaster Recovery Scenario: AZ Failure

If an entire Availability Zone fails, PXC will detect the loss of nodes. ProxySQL will automatically stop sending traffic to the unreachable nodes. The remaining nodes in the healthy AZ will continue to operate. Since PXC is synchronous, there will be no data loss.

To fully restore the cluster to its original state (or a new desired state), you would typically:

Provision new instances in the affected AZ.
Use Ansible to deploy PXC on the new instances, configuring them to join the existing cluster. ProxySQL will automatically detect and integrate these new nodes once they are healthy.
If the original AZ is permanently lost, you might need to reconfigure ProxySQL to reflect the new topology and potentially rebalance read traffic.

For more advanced scenarios, consider integrating with OVH’s API to automate the provisioning of new instances in case of an AZ failure, further reducing recovery time.

Architecting Resilient C Deployments with Kubernetes and Helm

For stateless applications, particularly those written in C and compiled for high performance, achieving resilience and automated recovery is often best managed within a container orchestration platform like Kubernetes. This section outlines how to deploy and manage C applications on OVH’s Kubernetes Service (K8s) with a focus on automated failover and self-healing capabilities.

OVH Kubernetes Service (K8s) Setup

OVH’s Managed Kubernetes service simplifies the deployment and management of Kubernetes clusters. When setting up your cluster, ensure you distribute your worker nodes across multiple Availability Zones for high availability. This is crucial for ensuring that your application pods can be rescheduled to healthy nodes if a node or an entire AZ fails.

Key Considerations:

Node Pools: Create multiple node pools, each spanning different AZs. This allows Kubernetes to schedule pods across these zones.
Resource Limits: Define appropriate CPU and memory requests and limits for your C application pods to ensure predictable performance and prevent resource starvation.
Networking: Understand OVH’s CNI (Container Network Interface) implementation and how it integrates with your application’s networking requirements.

Containerizing C Applications

The first step is to create a Dockerfile for your C application. For performance-critical C applications, minimizing the base image size and ensuring efficient compilation is key.

# Use a minimal base image for compilation
FROM gcc:11 AS builder

WORKDIR /app

# Copy source code
COPY . /app

# Compile the C application
# -O3 for aggressive optimization
# -march=native to optimize for the host CPU architecture (use with caution if portability is needed)
# -static to create a statically linked binary, reducing runtime dependencies
RUN gcc -O3 -march=native -static my_app.c -o my_app

# Use a minimal runtime image (e.g., alpine)
FROM alpine:latest

WORKDIR /app

# Copy the compiled binary from the builder stage
COPY --from=builder /app/my_app /app/my_app

# Expose any necessary ports
EXPOSE 8080

# Command to run the application
CMD ["/app/my_app"]

Build and push this Docker image to a container registry accessible by your OVH K8s cluster (e.g., OVH’s Container Registry or Docker Hub).

Deploying with Helm for Resilience

Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. We’ll create a Helm chart to deploy our C application with high availability and self-healing capabilities.

Helm Chart Structure

A basic Helm chart structure:

my-c-app/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    └── service.yaml

Chart.yaml

apiVersion: v2
name: my-c-app
description: A Helm chart for deploying a C application
version: 0.1.0
appVersion: "1.0.0"

values.yaml

replicaCount: 3

image:
  repository: your-registry/my-c-app
  pullPolicy: IfNotPresent
  tag: "latest"

service:
  type: ClusterIP
  port: 8080

resources: {}
  # limits:
  #   cpu: 500m
  #   memory: 512Mi
  # requests:
  #   cpu: 250m
  #   memory: 256Mi

nodeSelector: {}

tolerations: []

affinity: {}

templates/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "my-c-app.fullname" . }}
  labels:
    {{- include "my-c-app.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "my-c-app.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "my-c-app.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      nodeSelector: {{ .Values.nodeSelector | toYaml | nindent 8 }}
      tolerations: {{ .Values.tolerations | toYaml | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz # Assuming your app has a /healthz endpoint
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz # Assuming your app has a /readyz endpoint
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}

templates/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-c-app.fullname" . }}
  labels:
    {{- include "my-c-app.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
  selector:
    {{- include "my-c-app.selectorLabels" . | nindent 4 }}

Key Resilience Features in the Deployment:

replicaCount: 3: Ensures that at least three instances of your application are running. Kubernetes will maintain this count.
livenessProbe and readinessProbe: These probes allow Kubernetes to detect unhealthy application instances. If a liveness probe fails, Kubernetes will restart the pod. If a readiness probe fails, Kubernetes will stop sending traffic to that pod until it becomes ready again.
Pod Anti-Affinity (Implicit via Node Pools): By deploying nodes across multiple AZs and using default Kubernetes scheduling, pods are naturally distributed. For explicit control, you can add pod anti-affinity rules in the deployment.yaml to ensure replicas are spread across different nodes or even AZs.

Deploying and Managing

Install Helm if you haven’t already. Then, package your chart and deploy it to your OVH K8s cluster:

# Navigate to your chart directory
cd my-c-app

# Package the chart
helm package .

# Install the chart
helm install my-release ./my-c-app-0.1.0.tgz --namespace my-app-ns --create-namespace

Kubernetes will now ensure that 3 replicas of your C application are running. If a node fails, Kubernetes will detect the pod’s unavailability and reschedule it onto a healthy node, potentially in a different AZ if configured.

Advanced Failover Strategies

For more sophisticated failover scenarios, consider:

Custom Controllers/Operators: Develop Kubernetes Operators to manage the lifecycle of your C application, including complex failover logic or integration with external systems.
Service Mesh (e.g., Istio, Linkerd): Implement a service mesh to gain advanced traffic management capabilities, including automatic retries, circuit breaking, and fine-grained control over traffic routing during failures.
External Load Balancers: Use OVH’s Load Balancer service in front of your Kubernetes cluster for external traffic, configured to health-check your application’s ingress or service endpoints.

By leveraging Kubernetes’ built-in self-healing mechanisms and Helm for declarative deployments, you can achieve robust automated failover for your C applications, ensuring high availability and resilience on OVH’s infrastructure.