Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WordPress Deployments on OVH

Leveraging PostgreSQL Streaming Replication for High Availability

Achieving automated failover for PostgreSQL requires a robust replication strategy. We’ll focus on setting up synchronous streaming replication, which guarantees that a transaction is committed on both the primary and at least one replica before acknowledging the commit to the client. This minimizes data loss during a failover event. Our deployment will be on OVH’s Public Cloud, utilizing their managed PostgreSQL service or self-managed instances.

The core components are the primary PostgreSQL server and one or more replica servers. For automated failover, we’ll need a mechanism to monitor the primary’s health and orchestrate the promotion of a replica.

Configuring PostgreSQL for Streaming Replication

On the primary PostgreSQL server, modify the postgresql.conf and pg_hba.conf files. Ensure WAL (Write-Ahead Logging) is configured for archiving or replication.

`postgresql.conf` (Primary)

wal_level = replica
wal_sync_method = fsync
wal_buffers = 16MB
max_wal_senders = 5
wal_keep_segments = 64 # Adjust based on expected write volume and network latency
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Example: local archive, adjust for network storage

The archive_command is crucial for recovery if a replica falls too far behind or for setting up Point-In-Time Recovery (PITR). For synchronous replication, synchronous_commit and synchronous_standby_names are key.

`postgresql.conf` (Replica)

hot_standby = on
max_standby_streaming_delay = 30s # Adjust based on acceptable lag
wal_receiver_status_interval = 10s
hot_standby_feedback = on

`pg_hba.conf` (Primary)

# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     repl_user       /32         scram-sha-256 # Or md5 if scram is not supported
host    replication     repl_user       /32       scram-sha-256
host    all             all             0.0.0.0/0               scram-sha-256 # For application access, adjust as needed

Create the replication user and grant necessary privileges.

-- On primary
CREATE ROLE repl_user WITH REPLICATION LOGIN PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO repl_user;
-- If using specific databases for replication, grant access
-- GRANT CONNECT ON DATABASE another_db TO repl_user;

Setting up the Replica

Before starting the replica, it needs to be initialized from a base backup of the primary. Use pg_basebackup.

# Stop PostgreSQL on the replica if it's running
sudo systemctl stop postgresql

# Ensure data directory is empty or removed
sudo rm -rf /var/lib/postgresql/14/main/* # Adjust path for your PostgreSQL version and data directory

# Perform base backup
sudo -u postgres pg_basebackup -h  -p 5432 -U repl_user -D /var/lib/postgresql/14/main -Fp -Xs -P -R

# -h: primary host
# -p: primary port
# -U: replication user
# -D: data directory on replica
# -Fp: plain format
# -Xs: stream WAL
# -P: progress
# -R: create recovery configuration files (standby.signal and postgresql.auto.conf)

The -R option is convenient as it creates standby.signal and appends settings to postgresql.auto.conf for recovery. If not using -R, you’d manually create standby.signal and configure recovery.conf (for older versions) or postgresql.conf (for newer versions) with primary_conninfo and restore_command.

Implementing Synchronous Replication

On the primary, update postgresql.conf to define the synchronous replication setup. The synchronous_standby_names parameter specifies which replicas must confirm a transaction before it’s committed.

`postgresql.conf` (Primary – Synchronous Settings)

synchronous_commit = remote_write # or remote_apply for stricter consistency
synchronous_standby_names = 'replica1,replica2' # Or '1 (replica1, replica2)' for quorum, or 'ANY 1 (replica1, replica2)' for at least one

synchronous_commit = remote_write means the primary waits for the replica to receive the WAL records and flush them to disk. remote_apply means it waits for the replica to apply the transaction. The latter offers stronger consistency but higher latency.

The synchronous_standby_names can be configured in several ways:

'replica1': Only replica1 must confirm.
'1 (replica1, replica2)': A quorum of 1 out of 2 replicas must confirm.
'ANY 1 (replica1, replica2)': At least 1 of the listed replicas must confirm.
'FIRST 2 (replica1, replica2)': The first 2 replicas in the list must confirm.

Restart PostgreSQL on both primary and replica for changes to take effect.

Automated Failover with Patroni

Manual failover is error-prone and slow. Patroni is an open-source template for PostgreSQL high availability that manages replication, failover, and configuration. It uses a distributed configuration store (like etcd, Consul, or ZooKeeper) to maintain cluster state.

Setting up etcd

For this example, we’ll use etcd. Deploy a small etcd cluster (at least 3 nodes for production). On OVH, you can deploy VMs and install etcd.

# Example etcd installation on Ubuntu
sudo apt update
sudo apt install etcd

# Configure etcd (e.g., /etc/etcd/etcd.conf.yml)
# Ensure it's clustered and accessible by Patroni nodes.
# For simplicity, a single node etcd for testing:
ETCD_NAME=etcd0
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_LISTEN_PEER_URLS="http://:2380"
ETCD_LISTEN_CLIENT_URLS="http://:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://:2379"
ETCD_INITIAL_CLUSTER="etcd0=http://:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DISCOVERY=""
ETCD_LOG_PACKAGE_LEVEL="info"

# Start etcd
sudo systemctl enable etcd
sudo systemctl start etcd

Installing and Configuring Patroni

Install Patroni and its dependencies (e.g., psycopg2 for Python). We’ll run Patroni on each PostgreSQL node.

# Install Patroni (using pip is common)
sudo apt install python3-pip python3-dev build-essential libpq-dev
sudo pip3 install psycopg2-binary 'patroni[etcd]'

# Create a configuration file for Patroni (e.g., /etc/patroni/patroni.yml)

`patroni.yml` (Example)

scope: my_pg_cluster # Unique identifier for this cluster
namespace: /service/ # Base path in etcd

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/14/main # Ensure this matches your PG data dir
  bin_dir: /usr/lib/postgresql/14/bin # Adjust path
  pg_hba:
    - host    replication     replicator      0.0.0.0/0       scram-sha-256
    - host    all             all             0.0.0.0/0       scram-sha-256
  replication:
    username: replicator
    password: your_replication_password
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    effective_cache_size: 768MB
    maintenance_work_mem: 64MB
    synchronous_commit: remote_write
    synchronous_standby_names: '1 (replica1,replica2)' # Example, adjust as needed

# Distributed configuration store (etcd)
etcd:
  hosts:
    - :2379
    - :2379
    - :2379
  protocol: http # or https if using TLS

# Replication settings for Patroni
replication:
  synchronous_mode: true
  synchronous_node_count: 1 # Number of synchronous replicas required

# REST API for monitoring and control
restapi:
  listen: 0.0.0.0:8008
  connect_address: :8008

# Tags for identifying nodes
tags:
  nofailover: false
  clonefrom: false

Ensure the replicator user exists and has replication privileges. Patroni will manage the pg_hba.conf based on its configuration, so ensure it’s writable by the PostgreSQL user.

Running Patroni as a Service

Create a systemd service file for Patroni.

# /etc/systemd/system/patroni.service
[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start the service on all PostgreSQL nodes.

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

WordPress Integration and Load Balancing

WordPress applications need to connect to the *current* primary PostgreSQL instance. This requires a dynamic way to discover the primary’s IP address. A common approach is to use a load balancer or a proxy that can be updated when a failover occurs.

Using HAProxy for Dynamic Backend Updates

HAProxy can monitor the PostgreSQL cluster’s health via Patroni’s REST API and automatically direct traffic to the active primary. We’ll configure HAProxy to check the Patroni API for the primary’s status.

HAProxy Configuration

# /etc/haproxy/haproxy.cfg

global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon

defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

listen pg_cluster
bind *:5432
mode tcp
option tcp-check
# Use a script to dynamically fetch the primary IP from Patroni
tcp-check connect port 5432
tcp-check send GET_PRIMARY\r\n
tcp-check expect ! rcvfail
# This is a simplified example. A more robust solution would involve a script
# that queries Patroni's API and updates HAProxy's backend list.
# For dynamic backend updates, HAProxy can be reloaded or use its API.
# A common pattern is to have a script that periodically queries Patroni
# and updates a backend list file that HAProxy reads, or uses the HAProxy API.

# Example using a script to update backends (requires HAProxy 1.8+ for stats socket scripting)
# Or, use a separate script that reloads HAProxy configuration.

# For simplicity, let's assume a static list and rely on Patroni's health checks.
# A better approach is to use the HAProxy API or a dynamic configuration generator.

# Example using a script to update the backend list dynamically:
# The script would query Patroni's API, get the primary IP, and then
# use the HAProxy stats socket to add/remove backends.

# For a more direct approach, let's configure health checks that rely on Patroni's API.
# This requires a custom health check script.

# Alternative: Use a dedicated PostgreSQL proxy like pgpool-II or ProxySQL,
# which have built-in support for PostgreSQL HA and can be configured to
# query Patroni or etcd.

# For HAProxy, a common pattern is to have a script that:
# 1. Queries Patroni API for the primary node.
# 2. Generates a new HAProxy configuration snippet with the primary as a backend.
# 3. Reloads HAProxy.

# Example of a script that could be run by cron:
# (See below for script example)

# For demonstration, let's assume a single primary and two replicas,
# and HAProxy will try to connect to them, relying on Patroni to ensure
# only the primary is healthy.
server pg_primary :5432 check port 5432 inter 2s fall 3 rise 2
server pg_replica1 :5432 check port 5432 inter 2s fall 3 rise 2
server pg_replica2 :5432 check port 5432 inter 2s fall 3 rise 2

# Example of a script to dynamically update HAProxy backends
# This script would be run by cron or a similar scheduler.
# It queries Patroni's REST API to find the primary and then updates HAProxy.

# Script: /usr/local/bin/update_pg_haproxy.sh
#!/bin/bash

PATRONI_API_URL="http://:8008/primary"
HAPROXY_CONFIG_FILE="/etc/haproxy/conf.d/pg_backends.cfg"
HAPROXY_RELOAD_CMD="sudo systemctl reload haproxy"

# Get primary IP from Patroni API
PRIMARY_IP=$(curl -s "$PATRONI_API_URL" | jq -r '.host')

if [ -z "$PRIMARY_IP" ]; then
echo "Could not retrieve primary IP from Patroni."
exit 1
fi

# Generate new backend configuration
echo "server pg_primary $PRIMARY_IP:5432 check port 5432 inter 2s fall 3 rise 2" > "$HAPROXY_CONFIG_FILE"
echo "server pg_replica1 :5432 check port 5432 inter 2s fall 3 rise 2" >> "$HAPROXY_CONFIG_FILE"
echo "server pg_replica2 :5432 check port 5432 inter 2s fall 3 rise 2" >> "$HAPROXY_CONFIG_FILE"

echo "Updated HAProxy backends to point to $PRIMARY_IP"

# Reload HAProxy
$HAPROXY_RELOAD_CMD

Ensure the jq utility is installed for JSON parsing. The /etc/haproxy/conf.d/pg_backends.cfg file should be included in your main haproxy.cfg using an include directive.

WordPress Configuration

In your wp-config.php, point the database connection to the HAProxy listener.

// wp-config.php
define( 'DB_HOST', '127.0.0.1:5432' ); // Or the IP of your HAProxy instance if not on the same server
define( 'DB_NAME', 'your_wordpress_db' );
define( 'DB_USER', 'your_wordpress_user' );
define( 'DB_PASSWORD', 'your_wordpress_password' );
define( 'DB_CHARSET', 'utf8' );
define( 'DB_COLLATE', '' );

When a failover occurs, Patroni promotes a replica. The health check script will detect the change in primary status (e.g., by querying /primary endpoint on Patroni) and update HAProxy to point to the new primary. WordPress will then seamlessly connect to the new primary.

Testing the Failover

To test the automated failover:

Gracefully stop the primary PostgreSQL process on the current primary node (e.g., sudo systemctl stop postgresql).
Monitor Patroni’s logs and etcd for failover events.
Observe HAProxy’s logs and the pg_backends.cfg file to confirm it has updated to point to the new primary.
Verify WordPress can still connect and operate normally.
To restore the original primary, you’ll need to reconfigure it as a replica of the new primary and then potentially trigger another failover or re-promote it after it has caught up.

For more advanced scenarios, consider using tools like ProxySQL or pgpool-II, which offer more sophisticated connection pooling, query routing, and HA management features specifically for PostgreSQL.

Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WordPress Deployments on OVH

Leveraging PostgreSQL Streaming Replication for High Availability

Configuring PostgreSQL for Streaming Replication

postgresql.conf (Primary)

postgresql.conf (Replica)

pg_hba.conf (Primary)