Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and PHP Deployments on OVH

Automated PostgreSQL Failover with Patroni and OVH Public Cloud

Achieving true high availability for PostgreSQL databases, especially in a cloud environment like OVH Public Cloud, necessitates an automated failover strategy. Manual intervention during an outage is a non-starter for production systems. Patroni emerges as a robust, open-source solution for automating PostgreSQL high availability and failover. It leverages distributed consensus systems like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate failover.

This guide details the architecture and implementation of an automated failover system for PostgreSQL using Patroni on OVH Public Cloud, focusing on a three-node cluster for quorum and resilience. We’ll cover the setup of Patroni, its integration with PostgreSQL, and considerations for application-level failover.

Patroni Architecture and OVH Deployment Strategy

Our strategy involves deploying three PostgreSQL instances, each running Patroni. One instance will act as the primary, while the others will be replicas. Patroni monitors the health of the primary. If it detects a failure, it initiates a consensus-based election process among the remaining nodes to promote one of the replicas to become the new primary. This process is critical for minimizing downtime.

For the distributed consensus system, we’ll opt for etcd. OVH Public Cloud allows for the deployment of multiple virtual machines (VMs) within a private network, which is ideal for hosting both the PostgreSQL cluster and the etcd cluster. We’ll deploy etcd on separate VMs to ensure its availability is independent of the PostgreSQL nodes.

Setting up the etcd Cluster

A minimum of three etcd nodes is recommended for a fault-tolerant cluster. These nodes should reside on the same OVH private network as the PostgreSQL instances.

1. Install etcd:

On each etcd node (e.g., etcd-01, etcd-02, etcd-03), download and install etcd. The exact method may vary based on your chosen OS (e.g., Ubuntu, Debian). Using package managers is generally preferred.

For example, on Ubuntu:

sudo apt update && sudo apt install etcd-server

Configuring etcd Members

Each etcd node needs a configuration file, typically located at /etc/etcd/etcd.conf.yml. Ensure the --initial-cluster and --listen-peer-urls/--listen-client-urls are correctly set to point to the private IPs of your etcd nodes.

Example /etc/etcd/etcd.conf.yml for etcd-01 (replace IPs with actual private IPs):

name: etcd-01
data-dir: /var/lib/etcd
listen-peer-urls: http://10.10.0.1:2380
listen-client-urls: http://10.10.0.1:2379,http://127.0.0.1:2379
advertise-client-urls: http://10.10.0.1:2379
initial-advertise-peer-urls: http://10.10.0.1:2380
initial-cluster: etcd-01=http://10.10.0.1:2380,etcd-02=http://10.10.0.2:2380,etcd-03=http://10.10.0.3:2380
initial-cluster-state: new
proxy: false

Repeat this configuration for etcd-02 and etcd-03, adjusting the name and IP addresses accordingly.

Starting and Verifying etcd

Start the etcd service and enable it to start on boot:

sudo systemctl start etcd
sudo systemctl enable etcd

Verify the cluster status. On any etcd node:

ETCDCTL_API=3 etcdctl member list

You should see all three members listed. If not, check network connectivity between etcd nodes and review the configuration files and logs (journalctl -u etcd).

Setting up the PostgreSQL Cluster with Patroni

We’ll deploy three PostgreSQL instances (e.g., pg-01, pg-02, pg-03) on separate OVH VMs. Each VM will run a PostgreSQL instance and Patroni.

1. Install PostgreSQL and Patroni:

On each PostgreSQL node:

# Install PostgreSQL (e.g., PostgreSQL 14)
sudo apt update && sudo apt install postgresql postgresql-contrib

# Install Patroni and dependencies (e.g., python3-pip, python3-dev, build-essential)
sudo apt install python3-pip python3-dev build-essential
sudo pip3 install --upgrade pip
sudo pip3 install "patroni[etcd]" psycopg2-binary

Configuring Patroni

Patroni’s configuration is typically stored in a YAML file, e.g., /etc/patroni/patroni.yml. This file defines the PostgreSQL settings, etcd connection details, and replication parameters.

Example /etc/patroni/patroni.yml for pg-01 (replace IPs with actual private IPs):

# Global settings
scope: my_pg_cluster
namespace: /service/
name: pg-01 # Unique name for this node

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin
  config_dir: /etc/postgresql/14/main
  pg_hba: /etc/postgresql/14/main/pg_hba.conf
  authentication:
    replication:
      username: replicator
      password: YourReplicationPassword
    superuser:
      username: postgres
      password: YourPostgresPassword
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

# etcd configuration
etcd:
  host: 10.10.0.1:2379,10.10.0.2:2379,10.10.0.3:2379
  protocol: http

# Replication configuration
replication:
  synchronous_mode: false # Set to true for synchronous replication if latency permits
  synchronous_node_count: 1 # If synchronous_mode is true

# Tags for node identification
tags:
  nofailover: false
  clonefrom: false

Important Notes:

Replace placeholder IPs with the actual private IPs of your etcd and PostgreSQL nodes.
Ensure the scope is identical across all PostgreSQL nodes in the cluster.
The name must be unique for each Patroni instance.
Configure PostgreSQL’s pg_hba.conf to allow replication and superuser access from other nodes in the cluster and potentially from your application servers.
The replication user and password must be set up in PostgreSQL.
synchronous_mode: false is recommended for initial setup to avoid potential blocking during network partitions. Consider enabling it later if your network guarantees low latency and high reliability.

Create a similar configuration file for pg-02 and pg-03, changing only the name parameter to pg-02 and pg-03 respectively. Ensure the etcd.host list includes all etcd nodes.

Configuring PostgreSQL for Patroni

Patroni manages PostgreSQL configuration files. However, you need to ensure PostgreSQL is set up to be managed by Patroni. This often involves creating the necessary directories and ensuring the PostgreSQL service can be started and stopped by the user running Patroni.

1. Create PostgreSQL Data Directory:

sudo mkdir -p /var/lib/postgresql/14/main
sudo chown -R postgres:postgres /var/lib/postgresql/14/main

2. Configure pg_hba.conf:

Edit /etc/postgresql/14/main/pg_hba.conf on each node. Allow replication and superuser access from your private network range. For example:

# TYPE  DATABASE        USER            ADDRESS                 METHOD
# Allow replication connections from any node in the cluster
host    replication     replicator      10.10.0.0/24            md5
# Allow superuser access from any node in the cluster
host    all             postgres        10.10.0.0/24            md5
# Allow application access (adjust as needed)
host    all             all             10.10.0.0/24            md5

3. Create Replication User:

On one of the nodes (it doesn’t matter which one initially, as Patroni will manage it), connect as the postgres user and create the replication user:

sudo -u postgres psql -c "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'YourReplicationPassword';"

Starting Patroni and PostgreSQL

Patroni can manage the PostgreSQL service. Create a systemd service file for Patroni, e.g., /etc/systemd/system/patroni.service:

[Unit]
Description=Patroni - Highly Available PostgreSQL
After=syslog.target network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecStop=/bin/kill -HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Ensure the ExecStart path points to your Patroni executable. Reload systemd, start, and enable Patroni:

sudo systemctl daemon-reload
sudo systemctl start patroni
sudo systemctl enable patroni

Patroni will now attempt to start PostgreSQL. The first node to acquire a lock in etcd will become the initial primary. Subsequent nodes will start as replicas and connect to the primary for replication.

Verifying the PostgreSQL Cluster

Check the status of the Patroni service on each node:

sudo systemctl status patroni

You can also query etcd directly to see the cluster state:

ETCDCTL_API=3 etcdctl get /service/my_pg_cluster --prefix --keys-only

This should show keys like /service/my_pg_cluster/leader and keys for each node. The node listed as the leader is the current primary.

Connect to the PostgreSQL primary and verify replication status:

SELECT * FROM pg_stat_replication;

On the replica nodes, check the replication lag:

SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();

Simulating a Failover

To test the failover mechanism, you can stop the Patroni service on the current primary node. Patroni on the other nodes will detect the primary’s unavailability and initiate a new election.

On the current primary node (e.g., pg-01):

sudo systemctl stop patroni

Monitor the logs of the other Patroni instances (journalctl -u patroni -f). Within seconds, one of the replicas should be promoted, and the cluster should elect a new leader.

After the failover, connect to the new primary and verify that it’s serving read/write traffic and that the other nodes are replicating from it.

Application-Level Failover with PHP

While Patroni handles database failover, your PHP application needs to be aware of the current primary. Simply pointing your application to a single PostgreSQL IP address will result in downtime during a failover. Here are common strategies:

1. DNS-Based Failover

Maintain a DNS record (e.g., db.yourdomain.com) that always points to the IP address of the current PostgreSQL primary. When a failover occurs, you need a mechanism to update this DNS record.

Mechanism: A script or a dedicated service monitors the Patroni cluster state (e.g., by querying etcd or Patroni’s REST API) and updates the DNS record via your DNS provider’s API (e.g., OVH’s API, Cloudflare API). This requires careful consideration of DNS TTL to minimize propagation delays.

PHP Implementation: Your PHP application connects to db.yourdomain.com. No code changes are strictly necessary if DNS propagation is fast enough.

2. Load Balancer / Proxy Layer

Deploy a load balancer or proxy (e.g., HAProxy, Nginx with TCP proxying) in front of your PostgreSQL cluster. The load balancer’s health checks can be configured to monitor the PostgreSQL primary. When the primary fails, the load balancer stops sending traffic to it and directs it to the new primary.

Mechanism: HAProxy can be configured to perform health checks against PostgreSQL. Patroni can expose an API that allows external systems to query the current primary. A custom script could update HAProxy’s configuration or use its runtime API to reconfigure it upon failover.

PHP Implementation: Your PHP application connects to the load balancer’s IP address. Similar to DNS, no application code changes are needed if the load balancer is managed correctly.

3. Application-Level Connection Pooling and Retry Logic

This is often the most practical approach for PHP applications, especially when dealing with shared hosting or environments where external DNS/LB updates are complex. You can maintain a list of all PostgreSQL node IPs in your PHP application’s configuration and implement retry logic.

PHP Configuration Example (e.g., in a config file):

<?php
return [
    'db' => [
        'hosts' => [
            '10.10.0.10', // pg-01
            '10.10.0.11', // pg-02
            '10.10.0.12', // pg-03
        ],
        'port' => 5432,
        'database' => 'your_database',
        'username' => 'app_user',
        'password' => 'YourAppPassword',
        'retry_attempts' => 3,
        'retry_delay_ms' => 1000, // 1 second
    ],
];
?>

PHP Connection Logic with Retry:

<?php
// Assuming you are using PDO
function connect_to_database(array $config) {
    $hosts = $config['db']['hosts'];
    $attempts = $config['db']['retry_attempts'];
    $delay = $config['db']['retry_delay_ms'];

    $last_exception = null;

    for ($i = 0; $i <= $attempts; $i++) {
        shuffle($hosts); // Randomize host order to try different ones
        foreach ($hosts as $host) {
            $dsn = "pgsql:host={$host};port={$config['db']['port']};dbname={$config['db']['database']}";
            try {
                $pdo = new PDO($dsn, $config['db']['username'], $config['db']['password'], [
                    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
                    PDO::ATTR_TIMEOUT => 5, // Connection timeout in seconds
                ]);
                // Optionally, check if it's a read-write connection
                // $pdo->query("SELECT pg_is_in_recovery()"); // This would return false on primary
                return $pdo;
            } catch (PDOException $e) {
                $last_exception = $e;
                // Log the failed attempt to a specific host
                error_log("Database connection failed to {$host}: " . $e->getMessage());
            }
        }
        // If all hosts failed in this iteration, wait before retrying
        if ($i < $attempts) {
            usleep($delay * 1000); // Convert ms to microseconds
        }
    }

    // If all attempts failed
    throw new RuntimeException("Failed to connect to the database after multiple attempts.", 0, $last_exception);
}

// --- Usage ---
$dbConfig = require 'config/database.php'; // Load your config

try {
    $pdo = connect_to_database($dbConfig);
    echo "Successfully connected to the database!\n";
    // Proceed with database operations
} catch (RuntimeException $e) {
    echo "Database connection error: " . $e->getMessage() . "\n";
    // Handle the error, e.g., show a maintenance page
}
?>

This approach allows the application to automatically try connecting to other available PostgreSQL nodes if the current primary is unreachable. The shuffle($hosts) ensures that on each retry, a different host order is attempted, increasing the chance of hitting the new primary quickly. The PDO::ATTR_TIMEOUT is crucial to prevent long hangs.

Monitoring and Maintenance

Continuous monitoring is essential. Key metrics to track include:

Patroni health status (via its REST API or logs).
etcd cluster health.
PostgreSQL replication lag.
Disk space on database nodes.
CPU and memory utilization.
Network connectivity between nodes.

Tools like Prometheus with the `postgres_exporter` and a Patroni exporter, or commercial solutions, can provide comprehensive monitoring. Regularly test your failover procedure (e.g., quarterly) to ensure it functions as expected and to train your operations team.

Consider implementing automated backups using tools like pg_dump or pg_basebackup, storing them off-site or in a separate OVH storage solution. Patroni itself does not handle backups.

Conclusion

Implementing automated failover for PostgreSQL on OVH Public Cloud using Patroni provides a significant boost to your application’s availability. By combining Patroni’s robust cluster management with a well-defined application-level strategy for handling database endpoint changes, you can achieve near-zero downtime during PostgreSQL node failures. Remember that high availability is a layered approach, and while Patroni handles the database, your application’s resilience is equally critical.