Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and C Deployments on OVH

Establishing a High-Availability PostgreSQL Cluster with Automatic Failover

Achieving robust disaster recovery for PostgreSQL deployments hinges on implementing automatic failover mechanisms. This section details a production-ready architecture leveraging Patroni, a template for HA PostgreSQL, and Consul for distributed consensus, all orchestrated within the OVHcloud ecosystem. We’ll focus on a multi-region setup to ensure resilience against datacenter-level failures.

Core Components and Architecture Overview

Our strategy involves a primary PostgreSQL instance, one or more replica instances, and a distributed consensus system. Patroni manages the PostgreSQL lifecycle, including initialization, configuration, and failover. Consul provides the distributed key-value store and service discovery necessary for Patroni to elect a leader and coordinate state changes. For this example, we assume three OVHcloud Availability Zones (AZs) within a single region for initial demonstration, with the understanding that a true multi-region DR strategy would extend this across distinct OVHcloud geographic regions.

Deploying Consul for Distributed Consensus

Consul is the backbone of our automatic failover. It provides a reliable mechanism for Patroni to store cluster state and elect a leader. We’ll deploy a Consul cluster with an odd number of servers (typically 3 or 5) for quorum. For simplicity, we’ll outline the deployment of a single Consul server, but a production setup requires a clustered configuration.

First, install Consul on your chosen OVHcloud instances (e.g., Ubuntu 22.04 LTS). Ensure these instances have static IP addresses or are resolvable via DNS within your private network.

Consul Server Installation and Configuration

Download the latest Consul binary:

wget https://releases.hashicorp.com/consul/1.18.0/consul_1.18.0_linux_amd64.zip
unzip consul_1.18.0_linux_amd64.zip
sudo mv consul /usr/local/bin/

Create a systemd service file for Consul:

[Unit]
Description=Consul
Documentation=https://www.consul.io/docs
After=network.target

[Service]
ExecStart=/usr/local/bin/consul agent -server -bootstrap-expect=1 -data-dir=/opt/consul -node=consul-server-1 -bind=10.0.0.10 -client=0.0.0.0
ExecStop=/usr/local/bin/consul leave
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Create the data directory and start the service:

sudo mkdir -p /opt/consul
sudo systemctl daemon-reload
sudo systemctl enable consul
sudo systemctl start consul

For a production cluster, you would configure multiple servers with appropriate `bootstrap-expect` and `retry-join` parameters to form a quorum. The `-bind` address should be the instance’s private IP, and `-client` should be `0.0.0.0` to allow agents to connect from other nodes.

Deploying Patroni for PostgreSQL HA

Patroni simplifies the management of HA PostgreSQL clusters. It uses the `etcd`, `Consul`, or `Zookeeper` for leader election and configuration storage. We’ll use Consul as our backend.

Patroni Installation

Install Patroni and its dependencies (e.g., `python3-pip`, `python3-dev`, `build-essential`).

sudo apt update
sudo apt install -y python3-pip python3-dev build-essential
sudo pip3 install 'patroni[consul]' psycopg2-binary

Patroni Configuration

Create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`). This configuration will be used by all Patroni instances in the cluster.

scope: my_pg_cluster
namespace: /service/patroni/my_pg_cluster

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.0.21:8008 # IP of this Patroni node

etcd: # Using Consul as the backend
  host: 10.0.0.10:8500 # Consul server address
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.21:5432 # IP of this Patroni node
  data_dir: /var/lib/postgresql/15/main
  bin_dir: /usr/lib/postgresql/15/bin
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5
  replication:
    username: replicator
    password: your_replication_password
  parameters:
    max_connections: 100
    shared_buffers: 128MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    max_replication_slots: 5

tags:
  nofailover: false
  clonefrom: false

# For the primary node only, set clonefrom to true
# For replica nodes, clonefrom should be false or omitted

# Example for a replica node configuration (adjust scope and namespace if needed)
# scope: my_pg_cluster
# namespace: /service/patroni/my_pg_cluster
# ... other settings ...
# postgresql:
#   ...
#   clonefrom: false
# tags:
#   nofailover: false
#   clonefrom: false

Important Notes on Configuration:

Replace `10.0.0.10` with your Consul server’s IP and `10.0.0.21` with the specific Patroni node’s IP.
The `scope` defines the cluster name.
The `namespace` is the prefix in Consul where Patroni stores its data.
`restapi.connect_address` and `postgresql.connect_address` must be the IP address that other nodes can reach this Patroni instance on.
Ensure the `replication` username and password are set and match what you’ll configure in PostgreSQL.
The `pg_hba` entries are crucial for allowing replication and client connections. Adjust `0.0.0.0/0` to your specific network ranges for security.
For the initial primary node, `clonefrom` can be `true` or omitted. For replicas, it should be `false`. Patroni handles the cloning process automatically.

Patroni Systemd Service

Create a systemd service file for Patroni (e.g., `/etc/systemd/system/patroni.service`):

[Unit]
Description=Patroni High-Availability PostgreSQL
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecStop=/usr/local/bin/patronictl -c /etc/patroni/patroni.yml stop
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Ensure the `postgres` user has ownership of the PostgreSQL data directory and Patroni configuration.

sudo chown -R postgres:postgres /var/lib/postgresql/15/main
sudo chown -R postgres:postgres /etc/patroni

Start and enable the Patroni service:

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

Orchestrating the PostgreSQL Cluster

Deploy Patroni on at least three nodes. The first node to start will initialize PostgreSQL and become the primary. Subsequent nodes will detect the existing cluster in Consul, configure themselves as replicas, and attempt to clone data from the primary if necessary.

Initializing the Primary Node

When the first Patroni instance starts, it will:

Check Consul for an existing leader.
If none exists, it will attempt to acquire the leader lock.
Initialize a new PostgreSQL cluster in its `data_dir`.
Configure PostgreSQL for replication.
Start PostgreSQL as the primary.
Register itself as the primary in Consul.

You can verify the status using `patronictl`:

sudo patronictl -c /etc/patroni/patroni.yml list

The output should show one node as `P` (Primary).

Adding Replica Nodes

On the other nodes, after installing Patroni and configuring `patroni.yml` with their respective IPs, starting the `patroni` service will result in:

Patroni instances detecting the primary in Consul.
They will configure themselves as replicas.
If `clonefrom` is `true` (or omitted on replicas), Patroni will trigger a `pg_basebackup` from the primary.
Once the data is cloned, PostgreSQL will start as a replica.
They will register themselves as replicas in Consul.

The `patronictl list` command should now show multiple nodes, with one `P` and the rest `R` (Replica).

Simulating Failover and Testing

Automatic failover is triggered when the primary node becomes unreachable or Patroni detects a failure. Consul plays a key role here by detecting the loss of the primary’s health check or its absence from the consensus group.

Manual Failover Trigger

To test, you can manually stop the Patroni service on the primary node:

sudo systemctl stop patroni

Within seconds, Patroni on the replica nodes will detect the primary’s unavailability. A new leader election will occur, and one of the replicas will be promoted to primary. You can observe this process in the Patroni logs (`journalctl -u patroni -f`).

After the failover, run `patronictl list` again. You should see a new primary (`P`) and the old primary (if it comes back online) will reconfigure itself as a replica (`R`).

Testing Client Connectivity

Client applications should not connect directly to a specific PostgreSQL instance. Instead, they should use a load balancer or a DNS entry that points to the *current* primary. Patroni’s REST API can be queried to determine the current primary’s IP address. A common pattern is to use HAProxy or a similar tool configured to poll Patroni’s API.

Integrating with OVHcloud Load Balancers

For seamless client access, integrate with OVHcloud’s Load Balancer service. Configure the load balancer to poll the Patroni REST API on each node to dynamically identify the primary. The load balancer’s health checks should target the PostgreSQL port (5432) on the IP address reported by Patroni as the primary.

HAProxy Configuration Example

Here’s a simplified HAProxy configuration that uses Patroni’s API to find the primary. This requires a script or a dedicated service to query Patroni and update HAProxy’s backend dynamically, or a more advanced HAProxy setup with Lua scripting.

# This is a conceptual example. A real-world implementation would involve
# dynamic backend updates or a dedicated service.

frontend pg_frontend
    bind *:5432
    mode tcp
    default_backend pg_backend

backend pg_backend
    mode tcp
    balance roundrobin
    # This is where dynamic configuration is needed.
    # Ideally, this backend would only contain the IP of the current primary.
    # Example:
    # server pg_primary 10.0.0.21:5432 check port 8008 # Check Patroni API health
    # server pg_replica1 10.0.0.22:5432
    # server pg_replica2 10.0.0.23:5432

# A script would periodically query http://:8008/primary
# and update the HAProxy configuration or use HAProxy's runtime API.

Alternatively, you can use DNS-based service discovery if your Consul setup integrates with a DNS server (like CoreDNS or Consul DNS) that can resolve a service name to the primary’s IP.

Multi-Region Disaster Recovery Strategy

For true disaster recovery, extend this architecture across multiple OVHcloud regions. This involves:

Deploying separate Consul clusters in each target region.
Configuring Patroni to use the Consul cluster in its respective region.
Implementing cross-region replication for PostgreSQL. This can be achieved using Patroni’s `replication.synchronous_mode` and `replication.synchronous_node_names` parameters, or by setting up logical replication between primaries in different regions.
A global traffic manager (e.g., OVHcloud’s Global Load Balancing or a third-party DNS-based solution) to direct traffic to the active region’s load balancer.

In a multi-region setup, the primary in Region A handles writes. Replicas in Region A and Region B stream replication. If Region A fails, the global traffic manager redirects traffic to Region B. A Patroni instance in Region B will then be promoted to primary, potentially using a standby that was already streaming from Region A’s primary.

Security Considerations

Secure all communication channels:

Use TLS for Consul and PostgreSQL connections.
Restrict network access using OVHcloud Security Groups or firewall rules to only allow necessary traffic between cluster nodes and clients.
Manage database credentials securely, avoiding hardcoding them in configuration files. Use environment variables or a secrets management system.
Regularly audit logs for suspicious activity.

Conclusion

By combining Patroni, Consul, and a robust networking strategy within OVHcloud, you can architect a highly available PostgreSQL deployment with automatic failover. This setup significantly reduces downtime and ensures business continuity in the face of hardware failures or datacenter disruptions. The key is meticulous configuration, thorough testing, and a clear understanding of how each component contributes to the overall resilience of the system.