Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on OVH

Leveraging Patroni for PostgreSQL High Availability and Automated Failover

Achieving robust disaster recovery for PostgreSQL, especially in a cloud environment like OVH, necessitates an automated failover strategy. Manual intervention during an outage is a recipe for extended downtime and potential data loss. For this, we’ll architect a solution using Patroni, a template for PostgreSQL high availability with automated failover. Patroni orchestrates PostgreSQL instances, manages their lifecycle, and ensures a standby instance is promoted to primary when the current primary becomes unavailable.

Our setup will involve a distributed configuration store, such as etcd or Consul, to maintain the cluster state. For this example, we’ll use etcd due to its widespread adoption and straightforward integration. We’ll deploy three PostgreSQL nodes, one primary and two standbys, managed by Patroni. A load balancer or application-level connection routing will direct traffic to the current primary.

Prerequisites and Setup

Before diving into Patroni configuration, ensure you have:

Three OVH cloud instances (e.g., Ubuntu 22.04 LTS) provisioned with sufficient resources.
Network connectivity between these instances.
SSH access to all instances.
sudo privileges on all instances.
A running etcd cluster (at least 3 nodes recommended for HA) accessible from the PostgreSQL instances. For simplicity, we’ll assume a single etcd instance for this example, but a clustered setup is production-ready.

Install PostgreSQL and Patroni on each PostgreSQL node. We’ll use PostgreSQL 15 for this example.

Installing PostgreSQL and Patroni

On each PostgreSQL node:

sudo apt update
sudo apt install -y postgresql postgresql-contrib
sudo apt install -y python3-pip python3-venv
python3 -m venv /opt/patroni/venv
source /opt/patroni/venv/bin/activate
pip install --upgrade pip
pip install "patroni[etcd]"
deactivate

Ensure PostgreSQL is configured to listen on all interfaces (or at least the internal network interface) and that pg_hba.conf allows connections from other nodes and Patroni. For simplicity, we’ll modify postgresql.conf and pg_hba.conf later in the Patroni configuration.

Patroni Configuration for Automated Failover

Patroni’s configuration is typically provided in a YAML file. We’ll create a `patroni.yml` on each PostgreSQL node. The configuration needs to be identical across all nodes, with the exception of the `scope` parameter if you were managing multiple clusters on the same nodes (not recommended for production). The `scope` uniquely identifies the PostgreSQL cluster.

`patroni.yml` Example

Create /etc/patroni/patroni.yml on each PostgreSQL node:

# /etc/patroni/patroni.yml
scope: my_pg_cluster
namespace: /service/
restapi:
  listen: 0.0.0.0:8008
  connect_address: &connect_address <IP_ADDRESS_OF_THIS_NODE>:8008 # Replace with actual IP

etcd:
  host: &etcd_host <ETCD_HOST_IP>:2379 # Replace with your etcd host IP
  protocol: http

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  connect_address: *connect_address
  data_dir: /var/lib/postgresql/15/main
  pg_hba:
    - host replication replicator <REPLICATION_NETWORK>/24 md5 # Replace with your replication network
    - host all all <APPLICATION_NETWORK>/24 md5 # Replace with your application network
    - host all all <INTERNAL_NETWORK>/24 md5 # Replace with your internal network
  replication:
    username: replicator
    password: &replication_password <REPLICATION_PASSWORD> # Define a strong password
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10
    archive_mode: "on"
    archive_command: "cp %p /var/lib/postgresql/wal_archive/%f" # Ensure this directory exists and is writable

# Patroni specific settings
tags:
  nofailover: false
  clonefrom: false

# Replication settings
replication_mode: async # Or sync for synchronous replication, requires more setup

# Bootstrap configuration (only for the very first node)
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 1MB
    postgresql:
      use_pg_rewind: true
      recovery_conf:
        standby_mode: "on"
        primary_conninfo: "host=%s port=5432 user=replicator password=%s dbname=postgres" # Patroni will fill %s
        # Optional: primary_slot_name: "patroni_slot_name" # If using replication slots

# HA configuration
# This section is for the primary node to manage its own health.
# For standbys, Patroni will manage their state based on the DCS.
# The following is a simplified example. For production, consider more robust checks.
# healthcheck:
#   loop_wait: 10
#   retries: 5
#   timeout: 5
#   mode: "pg_isready"
#   pg_isready_query: "SELECT 1;"

# Logging
log:
  level: INFO
  dir: /var/log/patroni
  file: patroni.log

Important Notes on the Configuration:

Replace placeholders like <IP_ADDRESS_OF_THIS_NODE>, <ETCD_HOST_IP>, <REPLICATION_PASSWORD>, and network CIDRs with your actual values.
The scope must be the same on all nodes for a single cluster.
namespace is the etcd path prefix for this cluster.
restapi.connect_address should be the IP of the node Patroni is running on.
etcd.host points to your etcd server.
postgresql.pg_hba entries are crucial. Ensure your application servers and replication users can connect.
postgresql.replication.password must be a strong, unique password.
postgresql.parameters are basic tuning parameters. Adjust based on your workload.
bootstrap.postgresql.recovery_conf.primary_conninfo is a template. Patroni will substitute the primary’s connection details.
archive_command is essential for PITR and recovery. Ensure the directory /var/lib/postgresql/wal_archive/ exists and is writable by the PostgreSQL user.
replication_mode: async is simpler but offers less data durability. sync provides higher durability but requires careful tuning and can impact performance.

Initializing the PostgreSQL Cluster

The cluster initialization is handled by Patroni. You only need to run the bootstrap process on one node initially. Patroni will start PostgreSQL, configure it as a primary, and register itself in etcd.

Bootstraping the Primary Node

On the node designated as the initial primary, ensure the patroni.yml is correctly configured. Then, start Patroni:

sudo /opt/patroni/venv/bin/patroni /etc/patroni/patroni.yml

Check the Patroni logs (/var/log/patroni/patroni.log) for successful initialization. You should see messages indicating PostgreSQL starting, becoming primary, and registering in etcd. You can verify this by querying etcd:

ETCDCTL_API=3 etcdctl --endpoints=<ETCD_HOST_IP>:2379 get /service/my_pg_cluster/leader --keys-only

If this command returns a key, Patroni has registered the leader. You can also check the Patroni REST API:

curl http://<IP_ADDRESS_OF_THIS_NODE>:8008/patroni

This should return JSON detailing the cluster state, including the leader.

Starting Standby Nodes

On the other two nodes, ensure their patroni.yml files are identical (except for the restapi.connect_address which should point to their own IP). Then, start Patroni:

sudo /opt/patroni/venv/bin/patroni /etc/patroni/patroni.yml

Patroni on the standby nodes will detect the existing leader in etcd, configure PostgreSQL to stream replication from the primary, and start the PostgreSQL instance in a standby role. Monitor the logs to confirm successful replication setup.

Automated Failover in Action

Patroni continuously monitors the health of the primary PostgreSQL instance. If the primary becomes unresponsive (e.g., due to network failure, process crash, or instance shutdown), Patroni will:

Detect the primary’s unavailability (via etcd’s leader lock expiration or health checks).
Initiate a leader election among the remaining healthy standby nodes.
The winning standby will be promoted to primary.
Patroni will reconfigure other standbys to replicate from the new primary.
The cluster state in etcd will be updated to reflect the new primary.

Simulating a Failover

To test the failover mechanism, you can gracefully shut down the primary PostgreSQL process or the Patroni process on the primary node. For a more abrupt test, you could stop the PostgreSQL service or even reboot the primary instance.

# On the current primary node:
sudo systemctl stop patroni
# OR
sudo systemctl stop postgresql

Observe the logs on the other nodes. Within a short period (defined by etcd’s TTL and Patroni’s loop_wait), one of the standbys should be promoted. You can verify the new primary using the Patroni REST API or etcdctl.

Integrating with Your Application (Perl)

Directly connecting your Perl application to a single PostgreSQL IP address is problematic during failovers. You need a mechanism to discover the current primary. Several strategies exist:

Strategy 1: DNS-based Service Discovery

Use a dynamic DNS service or a custom script that updates a DNS record (e.g., db.yourdomain.com) to point to the IP address of the current PostgreSQL primary. Your Perl application connects to this DNS name.

A simple approach involves a cron job on one of your nodes (or a separate management server) that periodically queries the Patroni API for the leader and updates DNS. For OVH, you might use their API to manage DNS records.

Strategy 2: Patroni REST API Polling (Perl Script)

A more direct method is to have your Perl application (or a dedicated connection manager) poll the Patroni REST API of any of the nodes to determine the current primary. Once identified, it establishes a connection to that primary’s IP and port.

Here’s a simplified Perl script demonstrating how to find the primary:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use JSON;

my @patroni_nodes = (
    "http://<NODE1_IP>:8008",
    "http://<NODE2_IP>:8008",
    "http://<NODE3_IP>:8008",
);

my $ua = LWP::UserAgent->new;
$ua->timeout(5); # Timeout for API calls

my $primary_host = undef;
my $primary_port = 5432; # Default PostgreSQL port

foreach my $node_url (@patroni_nodes) {
    my $response = $ua->get("$node_url/patroni");
    if ($response->is_success) {
        my $data = decode_json($response->decoded_content);
        if ($data->{state} eq "running" && $data->{role} eq "master") {
            $primary_host = $data->{host};
            $primary_port = $data->{port};
            print "Found primary: $primary_host:$primary_port\n";
            last; # Found the primary, exit loop
        }
    } else {
        warn "Failed to get status from $node_url: " . $response->status_line;
    }
}

if (!defined $primary_host) {
    die "Could not determine PostgreSQL primary node.\n";
}

# Now you can use $primary_host and $primary_port to connect your database
# Example using DBI:
# use DBI;
# my $dsn = "dbi:Pg:host=$primary_host;port=$primary_port;database=your_db";
# my $dbh = DBI->connect($dsn, "your_user", "your_password", { RaiseError => 1 });
# ... your database operations ...
# $dbh->disconnect;

This script iterates through the known Patroni nodes, queries their REST API, and identifies the node reporting its role as “master”. This information is then used to establish a database connection. For production, you’d integrate this logic into your application’s connection management layer, perhaps using a connection pool that can re-query the primary if a connection fails.

Strategy 3: Dedicated Load Balancer/Proxy

Deploy a load balancer like HAProxy or a PostgreSQL-aware proxy like PgBouncer (though PgBouncer is more of a connection pooler and less of a direct failover manager in this context) in front of your PostgreSQL cluster. Configure the load balancer to health-check the Patroni API endpoints or PostgreSQL itself to determine the primary. HAProxy can be configured to dynamically update its backend based on Patroni’s state.

OVH Specific Considerations

When deploying on OVH, consider:

Network Segmentation: Utilize OVH’s private network capabilities to ensure secure and low-latency communication between PostgreSQL nodes, etcd, and your application servers.
Instance Sizing: Choose instance types that provide adequate CPU, RAM, and I/O for your PostgreSQL workload.
Monitoring: Integrate OVH’s monitoring tools with your Patroni and PostgreSQL metrics. Set up alerts for failover events, replication lag, and resource utilization.
Backup Strategy: Complement your HA setup with regular, off-instance backups (e.g., using pg_dump or filesystem-level snapshots) stored in a separate location.
Replication Slots: For critical applications, consider using PostgreSQL replication slots. Patroni can manage these. Replication slots prevent the primary from discarding WAL segments needed by a standby that is temporarily disconnected, ensuring no data loss upon reconnection. This requires careful management to avoid filling up disk space if a standby remains down for an extended period.

Securing Your PostgreSQL Deployment

Production deployments demand robust security. Ensure:

Firewall Rules: Restrict access to PostgreSQL (port 5432), Patroni API (port 8008), and etcd (port 2379) to only authorized IP addresses/networks.
Strong Passwords: Use strong, unique passwords for the replication user and any application users.
TLS Encryption: Configure PostgreSQL and Patroni to use TLS for encrypted communication, especially if traffic traverses public networks.
etcd Security: Secure your etcd cluster with TLS and authentication.

Conclusion

By implementing Patroni for PostgreSQL high availability and automated failover on OVH, you can significantly reduce downtime and improve the resilience of your data layer. The key is a well-configured Patroni setup, a reliable distributed configuration store like etcd, and an application-level strategy (like dynamic DNS or API polling) to ensure your Perl applications always connect to the active PostgreSQL primary. This architectural pattern provides a solid foundation for mission-critical database deployments.