Disaster Recovery 101: Architecting Auto-Failovers for MySQL and C Deployments on OVH
Establishing a High-Availability MySQL Cluster with Automatic Failover
Achieving true disaster recovery for critical databases like MySQL necessitates an automated failover strategy. Relying on manual intervention during an outage is a recipe for extended downtime and significant business impact. This section details the architecture and implementation of a highly available MySQL cluster on OVH, leveraging synchronous replication and an intelligent orchestrator for seamless failover.
We will focus on a Percona XtraDB Cluster (PXC) deployment. PXC utilizes Galera Cluster, a synchronous multi-master replication technology, which provides strong consistency and automatic node provisioning. This eliminates the complexities of asynchronous replication lag and manual failover scripts.
OVH Infrastructure Considerations
For optimal performance and resilience, deploy PXC nodes across different Availability Zones (AZs) within an OVH region. This ensures that a failure of an entire data center does not bring down the entire cluster. Network latency between AZs is a critical factor for synchronous replication; OVH’s robust internal network generally provides low latency, making this feasible.
Each node should have dedicated block storage (e.g., OVH’s SSD Block Storage) for optimal I/O performance. Network configuration should ensure high bandwidth and low latency between nodes. Consider using OVH’s private network capabilities for inter-node communication to enhance security and performance.
Percona XtraDB Cluster Deployment and Configuration
A typical PXC cluster consists of at least three nodes to maintain quorum. We’ll outline the configuration for a single node, which can then be replicated across the cluster.
Prerequisites:
- A Linux distribution (e.g., Ubuntu 20.04 LTS) on each OVH instance.
- Root or sudo access.
- Firewall rules allowing MySQL (3306), Galera replication (4567, 4568), and IST/SST ports (4444).
Installation (on each node):
First, add the Percona repository:
wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb sudo apt-get update
Install Percona XtraDB Cluster:
sudo apt-get install percona-xtradb-cluster
Configuration (on each node):
The primary configuration file is /etc/mysql/my.cnf. We need to configure it for Galera replication. Below is a sample configuration for the first node. Note the use of private IP addresses for inter-node communication.
[mysqld] server-id = 1 datadir=/var/lib/mysql socket=/var/run/mysqld/mysqld.sock log-error=/var/log/mysql/error.log pid-file=/var/run/mysqld/mysqld.pid # PXC specific settings wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_name="my_pxc_cluster" wsrep_cluster_address="gcomm://10.0.0.1,10.0.0.2,10.0.0.3" # Replace with private IPs of all nodes wsrep_node_name="pxc-node-1" wsrep_node_address="10.0.0.1" # This node's private IP wsrep_sst_method=xtrabackup wsrep_sst_auth="sstuser:your_sst_password" # InnoDB settings innodb_autoinc_lock_mode=2 innodb_flush_log_at_trx_commit=0 innodb_buffer_pool_size=4G # Adjust based on instance RAM # Other MySQL settings bind-address=0.0.0.0 skip-name-resolve max_connections=200
For subsequent nodes, increment server-id and update wsrep_node_name and wsrep_node_address accordingly. The wsrep_cluster_address should list all nodes in the cluster.
Initial Cluster Bootstrap:
Start the first node with the bootstrap option:
sudo systemctl start [email protected]
Once the first node is up and running, start the remaining nodes normally:
sudo systemctl start mysql.service
Verify cluster status:
SHOW STATUS LIKE 'wsrep_cluster_size';
This should show the total number of nodes in the cluster.
Implementing Automatic Failover with ProxySQL
While PXC provides high availability at the database layer, applications need a robust way to connect to a healthy node and automatically switch if a node fails. ProxySQL is an excellent choice for this. It’s a high-performance, high-availability, and scalable MySQL proxy that sits between your application and your database cluster.
ProxySQL can monitor the health of PXC nodes and route traffic accordingly. It also supports read/write splitting and query caching.
ProxySQL Installation and Configuration
Install ProxySQL on separate instances, ideally co-located with your application servers or in a dedicated proxy tier. For high availability of the proxy layer itself, deploy at least two ProxySQL instances.
# On Ubuntu/Debian sudo apt-get install proxysql
ProxySQL Configuration (/etc/proxysql.cnf):
[proxysql] ; General settings log_error=/var/log/proxysql/proxysql.log mysql-interfaces=0.0.0.0:6033 admin-interfaces=0.0.0.0:6032 [mysql_servers] # PXC Node 1 (Primary/Writer) 10.0.0.1:3306 # PXC Node 2 (Replica/Reader) 10.0.0.2:3306 # PXC Node 3 (Replica/Reader) 10.0.0.3:3306 [mysql_users] # Application user app_user:app_password:1:1000 [mysql_replication_hostgroups] # Writer hostgroup 1:1000 # Reader hostgroup 2:1000 [mysql_replication_topology] # Define the primary writer and its readers 1>2,3 [mysql_query_rules] # Route all queries to the writer hostgroup by default - match_digest=".*" group=1 # Example: Route SELECT queries to reader hostgroup if available # - match_digest="SELECT.*" group=2
ProxySQL Administration and Health Checks:
ProxySQL uses its admin interface (default 6032) to manage its configuration and monitor nodes. You can connect to it using the mysql client:
mysql -u admin -padmin -h 127.0.0.1 -P 6032
Load the configuration and start the monitoring:
LOAD MYSQL SERVERS TO RUNTIME; LOAD MYSQL USERS TO RUNTIME; LOAD MYSQL QUERY RULES TO RUNTIME; LOAD MYSQL REPLICATION TOPOLOGIES TO RUNTIME; SET mysql-monitor_ping_interval_ms = 1000; SET mysql-monitor_connect_interval_ms = 1000; SET mysql-monitor_read_only_interval_ms = 1000; PROXYSql_RELOAD_CONFIGURATION();
ProxySQL will now monitor the health of the PXC nodes. If a node becomes unhealthy (e.g., fails to respond to pings or connections), ProxySQL will automatically remove it from the active pool and direct traffic to the remaining healthy nodes. When the failed node recovers, ProxySQL will detect it and re-integrate it into the cluster.
Automating Failover Orchestration
While ProxySQL handles the immediate failover by rerouting traffic, a complete disaster recovery strategy might involve more complex orchestration, especially if an entire AZ becomes unavailable. For this, we can leverage tools like Ansible or custom scripts that monitor cluster health and can trigger actions.
Ansible Playbook for PXC Node Management
An Ansible playbook can automate the process of starting, stopping, and checking the status of PXC nodes. It can also be used to reconfigure nodes or perform SSTs if needed.
---
- name: Manage Percona XtraDB Cluster Nodes
hosts: pxc_cluster
become: yes
vars:
pxc_cluster_nodes:
- "10.0.0.1"
- "10.0.0.2"
- "10.0.0.3"
pxc_cluster_name: "my_pxc_cluster"
pxc_sst_user: "sstuser"
pxc_sst_password: "your_sst_password"
tasks:
- name: Ensure Percona XtraDB Cluster is installed
apt:
name: percona-xtradb-cluster
state: present
update_cache: yes
- name: Configure PXC node
template:
src: templates/my.cnf.j2
dest: /etc/mysql/my.cnf
notify: Restart MySQL
- name: Ensure MySQL service is running and enabled
systemd:
name: mysql
state: started
enabled: yes
handlers:
- name: Restart MySQL
systemd:
name: mysql
state: restarted
The templates/my.cnf.j2 file would dynamically generate the my.cnf based on the node’s IP and its position in the cluster.
Disaster Recovery Scenario: AZ Failure
If an entire Availability Zone fails, PXC will detect the loss of nodes. ProxySQL will automatically stop sending traffic to the unreachable nodes. The remaining nodes in the healthy AZ will continue to operate. Since PXC is synchronous, there will be no data loss.
To fully restore the cluster to its original state (or a new desired state), you would typically:
- Provision new instances in the affected AZ.
- Use Ansible to deploy PXC on the new instances, configuring them to join the existing cluster. ProxySQL will automatically detect and integrate these new nodes once they are healthy.
- If the original AZ is permanently lost, you might need to reconfigure ProxySQL to reflect the new topology and potentially rebalance read traffic.
For more advanced scenarios, consider integrating with OVH’s API to automate the provisioning of new instances in case of an AZ failure, further reducing recovery time.
Architecting Resilient C Deployments with Kubernetes and Helm
For stateless applications, particularly those written in C and compiled for high performance, achieving resilience and automated recovery is often best managed within a container orchestration platform like Kubernetes. This section outlines how to deploy and manage C applications on OVH’s Kubernetes Service (K8s) with a focus on automated failover and self-healing capabilities.
OVH Kubernetes Service (K8s) Setup
OVH’s Managed Kubernetes service simplifies the deployment and management of Kubernetes clusters. When setting up your cluster, ensure you distribute your worker nodes across multiple Availability Zones for high availability. This is crucial for ensuring that your application pods can be rescheduled to healthy nodes if a node or an entire AZ fails.
Key Considerations:
- Node Pools: Create multiple node pools, each spanning different AZs. This allows Kubernetes to schedule pods across these zones.
- Resource Limits: Define appropriate CPU and memory requests and limits for your C application pods to ensure predictable performance and prevent resource starvation.
- Networking: Understand OVH’s CNI (Container Network Interface) implementation and how it integrates with your application’s networking requirements.
Containerizing C Applications
The first step is to create a Dockerfile for your C application. For performance-critical C applications, minimizing the base image size and ensuring efficient compilation is key.
# Use a minimal base image for compilation FROM gcc:11 AS builder WORKDIR /app # Copy source code COPY . /app # Compile the C application # -O3 for aggressive optimization # -march=native to optimize for the host CPU architecture (use with caution if portability is needed) # -static to create a statically linked binary, reducing runtime dependencies RUN gcc -O3 -march=native -static my_app.c -o my_app # Use a minimal runtime image (e.g., alpine) FROM alpine:latest WORKDIR /app # Copy the compiled binary from the builder stage COPY --from=builder /app/my_app /app/my_app # Expose any necessary ports EXPOSE 8080 # Command to run the application CMD ["/app/my_app"]
Build and push this Docker image to a container registry accessible by your OVH K8s cluster (e.g., OVH’s Container Registry or Docker Hub).
Deploying with Helm for Resilience
Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. We’ll create a Helm chart to deploy our C application with high availability and self-healing capabilities.
Helm Chart Structure
A basic Helm chart structure:
my-c-app/
├── Chart.yaml
├── values.yaml
└── templates/
├── deployment.yaml
└── service.yaml
Chart.yaml
apiVersion: v2 name: my-c-app description: A Helm chart for deploying a C application version: 0.1.0 appVersion: "1.0.0"
values.yaml
replicaCount: 3
image:
repository: your-registry/my-c-app
pullPolicy: IfNotPresent
tag: "latest"
service:
type: ClusterIP
port: 8080
resources: {}
# limits:
# cpu: 500m
# memory: 512Mi
# requests:
# cpu: 250m
# memory: 256Mi
nodeSelector: {}
tolerations: []
affinity: {}
templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "my-c-app.fullname" . }}
labels:
{{- include "my-c-app.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "my-c-app.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "my-c-app.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
nodeSelector: {{ .Values.nodeSelector | toYaml | nindent 8 }}
tolerations: {{ .Values.tolerations | toYaml | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
livenessProbe:
httpGet:
path: /healthz # Assuming your app has a /healthz endpoint
port: http
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz # Assuming your app has a /readyz endpoint
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: {{ include "my-c-app.fullname" . }}
labels:
{{- include "my-c-app.labels" . | nindent 4 }}
spec:
type: {{ .Values.service.type }}
ports:
- port: {{ .Values.service.port }}
targetPort: http
protocol: TCP
name: http
selector:
{{- include "my-c-app.selectorLabels" . | nindent 4 }}
Key Resilience Features in the Deployment:
replicaCount: 3: Ensures that at least three instances of your application are running. Kubernetes will maintain this count.livenessProbeandreadinessProbe: These probes allow Kubernetes to detect unhealthy application instances. If a liveness probe fails, Kubernetes will restart the pod. If a readiness probe fails, Kubernetes will stop sending traffic to that pod until it becomes ready again.- Pod Anti-Affinity (Implicit via Node Pools): By deploying nodes across multiple AZs and using default Kubernetes scheduling, pods are naturally distributed. For explicit control, you can add pod anti-affinity rules in the
deployment.yamlto ensure replicas are spread across different nodes or even AZs.
Deploying and Managing
Install Helm if you haven’t already. Then, package your chart and deploy it to your OVH K8s cluster:
# Navigate to your chart directory cd my-c-app # Package the chart helm package . # Install the chart helm install my-release ./my-c-app-0.1.0.tgz --namespace my-app-ns --create-namespace
Kubernetes will now ensure that 3 replicas of your C application are running. If a node fails, Kubernetes will detect the pod’s unavailability and reschedule it onto a healthy node, potentially in a different AZ if configured.
Advanced Failover Strategies
For more sophisticated failover scenarios, consider:
- Custom Controllers/Operators: Develop Kubernetes Operators to manage the lifecycle of your C application, including complex failover logic or integration with external systems.
- Service Mesh (e.g., Istio, Linkerd): Implement a service mesh to gain advanced traffic management capabilities, including automatic retries, circuit breaking, and fine-grained control over traffic routing during failures.
- External Load Balancers: Use OVH’s Load Balancer service in front of your Kubernetes cluster for external traffic, configured to health-check your application’s ingress or service endpoints.
By leveraging Kubernetes’ built-in self-healing mechanisms and Helm for declarative deployments, you can achieve robust automated failover for your C applications, ensuring high availability and resilience on OVH’s infrastructure.