Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and C Deployments on OVH
Establishing a High-Availability PostgreSQL Cluster with Automatic Failover
Achieving robust disaster recovery for PostgreSQL deployments hinges on implementing automatic failover mechanisms. This section details a production-ready architecture leveraging Patroni, a template for HA PostgreSQL, and Consul for distributed consensus, all orchestrated within the OVHcloud ecosystem. We’ll focus on a multi-region setup to ensure resilience against datacenter-level failures.
Core Components and Architecture Overview
Our strategy involves a primary PostgreSQL instance, one or more replica instances, and a distributed consensus system. Patroni manages the PostgreSQL lifecycle, including initialization, configuration, and failover. Consul provides the distributed key-value store and service discovery necessary for Patroni to elect a leader and coordinate state changes. For this example, we assume three OVHcloud Availability Zones (AZs) within a single region for initial demonstration, with the understanding that a true multi-region DR strategy would extend this across distinct OVHcloud geographic regions.
Deploying Consul for Distributed Consensus
Consul is the backbone of our automatic failover. It provides a reliable mechanism for Patroni to store cluster state and elect a leader. We’ll deploy a Consul cluster with an odd number of servers (typically 3 or 5) for quorum. For simplicity, we’ll outline the deployment of a single Consul server, but a production setup requires a clustered configuration.
First, install Consul on your chosen OVHcloud instances (e.g., Ubuntu 22.04 LTS). Ensure these instances have static IP addresses or are resolvable via DNS within your private network.
Consul Server Installation and Configuration
Download the latest Consul binary:
wget https://releases.hashicorp.com/consul/1.18.0/consul_1.18.0_linux_amd64.zip unzip consul_1.18.0_linux_amd64.zip sudo mv consul /usr/local/bin/
Create a systemd service file for Consul:
[Unit] Description=Consul Documentation=https://www.consul.io/docs After=network.target [Service] ExecStart=/usr/local/bin/consul agent -server -bootstrap-expect=1 -data-dir=/opt/consul -node=consul-server-1 -bind=10.0.0.10 -client=0.0.0.0 ExecStop=/usr/local/bin/consul leave Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target
Create the data directory and start the service:
sudo mkdir -p /opt/consul sudo systemctl daemon-reload sudo systemctl enable consul sudo systemctl start consul
For a production cluster, you would configure multiple servers with appropriate `bootstrap-expect` and `retry-join` parameters to form a quorum. The `-bind` address should be the instance’s private IP, and `-client` should be `0.0.0.0` to allow agents to connect from other nodes.
Deploying Patroni for PostgreSQL HA
Patroni simplifies the management of HA PostgreSQL clusters. It uses the `etcd`, `Consul`, or `Zookeeper` for leader election and configuration storage. We’ll use Consul as our backend.
Patroni Installation
Install Patroni and its dependencies (e.g., `python3-pip`, `python3-dev`, `build-essential`).
sudo apt update sudo apt install -y python3-pip python3-dev build-essential sudo pip3 install 'patroni[consul]' psycopg2-binary
Patroni Configuration
Create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`). This configuration will be used by all Patroni instances in the cluster.
scope: my_pg_cluster
namespace: /service/patroni/my_pg_cluster
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.21:8008 # IP of this Patroni node
etcd: # Using Consul as the backend
host: 10.0.0.10:8500 # Consul server address
protocol: http
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.21:5432 # IP of this Patroni node
data_dir: /var/lib/postgresql/15/main
bin_dir: /usr/lib/postgresql/15/bin
pg_hba:
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
replication:
username: replicator
password: your_replication_password
parameters:
max_connections: 100
shared_buffers: 128MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
max_replication_slots: 5
tags:
nofailover: false
clonefrom: false
# For the primary node only, set clonefrom to true
# For replica nodes, clonefrom should be false or omitted
# Example for a replica node configuration (adjust scope and namespace if needed)
# scope: my_pg_cluster
# namespace: /service/patroni/my_pg_cluster
# ... other settings ...
# postgresql:
# ...
# clonefrom: false
# tags:
# nofailover: false
# clonefrom: false
Important Notes on Configuration:
- Replace `10.0.0.10` with your Consul server’s IP and `10.0.0.21` with the specific Patroni node’s IP.
- The `scope` defines the cluster name.
- The `namespace` is the prefix in Consul where Patroni stores its data.
- `restapi.connect_address` and `postgresql.connect_address` must be the IP address that other nodes can reach this Patroni instance on.
- Ensure the `replication` username and password are set and match what you’ll configure in PostgreSQL.
- The `pg_hba` entries are crucial for allowing replication and client connections. Adjust `0.0.0.0/0` to your specific network ranges for security.
- For the initial primary node, `clonefrom` can be `true` or omitted. For replicas, it should be `false`. Patroni handles the cloning process automatically.
Patroni Systemd Service
Create a systemd service file for Patroni (e.g., `/etc/systemd/system/patroni.service`):
[Unit] Description=Patroni High-Availability PostgreSQL After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml ExecStop=/usr/local/bin/patronictl -c /etc/patroni/patroni.yml stop Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
Ensure the `postgres` user has ownership of the PostgreSQL data directory and Patroni configuration.
sudo chown -R postgres:postgres /var/lib/postgresql/15/main sudo chown -R postgres:postgres /etc/patroni
Start and enable the Patroni service:
sudo systemctl daemon-reload sudo systemctl enable patroni sudo systemctl start patroni
Orchestrating the PostgreSQL Cluster
Deploy Patroni on at least three nodes. The first node to start will initialize PostgreSQL and become the primary. Subsequent nodes will detect the existing cluster in Consul, configure themselves as replicas, and attempt to clone data from the primary if necessary.
Initializing the Primary Node
When the first Patroni instance starts, it will:
- Check Consul for an existing leader.
- If none exists, it will attempt to acquire the leader lock.
- Initialize a new PostgreSQL cluster in its `data_dir`.
- Configure PostgreSQL for replication.
- Start PostgreSQL as the primary.
- Register itself as the primary in Consul.
You can verify the status using `patronictl`:
sudo patronictl -c /etc/patroni/patroni.yml list
The output should show one node as `P` (Primary).
Adding Replica Nodes
On the other nodes, after installing Patroni and configuring `patroni.yml` with their respective IPs, starting the `patroni` service will result in:
- Patroni instances detecting the primary in Consul.
- They will configure themselves as replicas.
- If `clonefrom` is `true` (or omitted on replicas), Patroni will trigger a `pg_basebackup` from the primary.
- Once the data is cloned, PostgreSQL will start as a replica.
- They will register themselves as replicas in Consul.
The `patronictl list` command should now show multiple nodes, with one `P` and the rest `R` (Replica).
Simulating Failover and Testing
Automatic failover is triggered when the primary node becomes unreachable or Patroni detects a failure. Consul plays a key role here by detecting the loss of the primary’s health check or its absence from the consensus group.
Manual Failover Trigger
To test, you can manually stop the Patroni service on the primary node:
sudo systemctl stop patroni
Within seconds, Patroni on the replica nodes will detect the primary’s unavailability. A new leader election will occur, and one of the replicas will be promoted to primary. You can observe this process in the Patroni logs (`journalctl -u patroni -f`).
After the failover, run `patronictl list` again. You should see a new primary (`P`) and the old primary (if it comes back online) will reconfigure itself as a replica (`R`).
Testing Client Connectivity
Client applications should not connect directly to a specific PostgreSQL instance. Instead, they should use a load balancer or a DNS entry that points to the *current* primary. Patroni’s REST API can be queried to determine the current primary’s IP address. A common pattern is to use HAProxy or a similar tool configured to poll Patroni’s API.
Integrating with OVHcloud Load Balancers
For seamless client access, integrate with OVHcloud’s Load Balancer service. Configure the load balancer to poll the Patroni REST API on each node to dynamically identify the primary. The load balancer’s health checks should target the PostgreSQL port (5432) on the IP address reported by Patroni as the primary.
HAProxy Configuration Example
Here’s a simplified HAProxy configuration that uses Patroni’s API to find the primary. This requires a script or a dedicated service to query Patroni and update HAProxy’s backend dynamically, or a more advanced HAProxy setup with Lua scripting.
# This is a conceptual example. A real-world implementation would involve
# dynamic backend updates or a dedicated service.
frontend pg_frontend
bind *:5432
mode tcp
default_backend pg_backend
backend pg_backend
mode tcp
balance roundrobin
# This is where dynamic configuration is needed.
# Ideally, this backend would only contain the IP of the current primary.
# Example:
# server pg_primary 10.0.0.21:5432 check port 8008 # Check Patroni API health
# server pg_replica1 10.0.0.22:5432
# server pg_replica2 10.0.0.23:5432
# A script would periodically query http://:8008/primary
# and update the HAProxy configuration or use HAProxy's runtime API.
Alternatively, you can use DNS-based service discovery if your Consul setup integrates with a DNS server (like CoreDNS or Consul DNS) that can resolve a service name to the primary’s IP.
Multi-Region Disaster Recovery Strategy
For true disaster recovery, extend this architecture across multiple OVHcloud regions. This involves:
- Deploying separate Consul clusters in each target region.
- Configuring Patroni to use the Consul cluster in its respective region.
- Implementing cross-region replication for PostgreSQL. This can be achieved using Patroni’s `replication.synchronous_mode` and `replication.synchronous_node_names` parameters, or by setting up logical replication between primaries in different regions.
- A global traffic manager (e.g., OVHcloud’s Global Load Balancing or a third-party DNS-based solution) to direct traffic to the active region’s load balancer.
In a multi-region setup, the primary in Region A handles writes. Replicas in Region A and Region B stream replication. If Region A fails, the global traffic manager redirects traffic to Region B. A Patroni instance in Region B will then be promoted to primary, potentially using a standby that was already streaming from Region A’s primary.
Security Considerations
Secure all communication channels:
- Use TLS for Consul and PostgreSQL connections.
- Restrict network access using OVHcloud Security Groups or firewall rules to only allow necessary traffic between cluster nodes and clients.
- Manage database credentials securely, avoiding hardcoding them in configuration files. Use environment variables or a secrets management system.
- Regularly audit logs for suspicious activity.
Conclusion
By combining Patroni, Consul, and a robust networking strategy within OVHcloud, you can architect a highly available PostgreSQL deployment with automatic failover. This setup significantly reduces downtime and ensures business continuity in the face of hardware failures or datacenter disruptions. The key is meticulous configuration, thorough testing, and a clear understanding of how each component contributes to the overall resilience of the system.