Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WordPress Deployments on OVH
Leveraging PostgreSQL Streaming Replication for High Availability
Achieving automated failover for PostgreSQL requires a robust replication strategy. We’ll focus on setting up synchronous streaming replication, which guarantees that a transaction is committed on both the primary and at least one replica before acknowledging the commit to the client. This minimizes data loss during a failover event. Our deployment will be on OVH’s Public Cloud, utilizing their managed PostgreSQL service or self-managed instances.
The core components are the primary PostgreSQL server and one or more replica servers. For automated failover, we’ll need a mechanism to monitor the primary’s health and orchestrate the promotion of a replica.
Configuring PostgreSQL for Streaming Replication
On the primary PostgreSQL server, modify the postgresql.conf and pg_hba.conf files. Ensure WAL (Write-Ahead Logging) is configured for archiving or replication.
postgresql.conf (Primary)
wal_level = replica wal_sync_method = fsync wal_buffers = 16MB max_wal_senders = 5 wal_keep_segments = 64 # Adjust based on expected write volume and network latency archive_mode = on archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Example: local archive, adjust for network storage
The archive_command is crucial for recovery if a replica falls too far behind or for setting up Point-In-Time Recovery (PITR). For synchronous replication, synchronous_commit and synchronous_standby_names are key.
postgresql.conf (Replica)
hot_standby = on max_standby_streaming_delay = 30s # Adjust based on acceptable lag wal_receiver_status_interval = 10s hot_standby_feedback = on
pg_hba.conf (Primary)
# TYPE DATABASE USER ADDRESS METHOD host replication repl_user/32 scram-sha-256 # Or md5 if scram is not supported host replication repl_user /32 scram-sha-256 host all all 0.0.0.0/0 scram-sha-256 # For application access, adjust as needed
Create the replication user and grant necessary privileges.
-- On primary CREATE ROLE repl_user WITH REPLICATION LOGIN PASSWORD 'your_secure_password'; GRANT CONNECT ON DATABASE your_database TO repl_user; -- If using specific databases for replication, grant access -- GRANT CONNECT ON DATABASE another_db TO repl_user;
Setting up the Replica
Before starting the replica, it needs to be initialized from a base backup of the primary. Use pg_basebackup.
# Stop PostgreSQL on the replica if it's running sudo systemctl stop postgresql # Ensure data directory is empty or removed sudo rm -rf /var/lib/postgresql/14/main/* # Adjust path for your PostgreSQL version and data directory # Perform base backup sudo -u postgres pg_basebackup -h-p 5432 -U repl_user -D /var/lib/postgresql/14/main -Fp -Xs -P -R # -h: primary host # -p: primary port # -U: replication user # -D: data directory on replica # -Fp: plain format # -Xs: stream WAL # -P: progress # -R: create recovery configuration files (standby.signal and postgresql.auto.conf)
The -R option is convenient as it creates standby.signal and appends settings to postgresql.auto.conf for recovery. If not using -R, you’d manually create standby.signal and configure recovery.conf (for older versions) or postgresql.conf (for newer versions) with primary_conninfo and restore_command.
Implementing Synchronous Replication
On the primary, update postgresql.conf to define the synchronous replication setup. The synchronous_standby_names parameter specifies which replicas must confirm a transaction before it’s committed.
postgresql.conf (Primary – Synchronous Settings)
synchronous_commit = remote_write # or remote_apply for stricter consistency synchronous_standby_names = 'replica1,replica2' # Or '1 (replica1, replica2)' for quorum, or 'ANY 1 (replica1, replica2)' for at least one
synchronous_commit = remote_write means the primary waits for the replica to receive the WAL records and flush them to disk. remote_apply means it waits for the replica to apply the transaction. The latter offers stronger consistency but higher latency.
The synchronous_standby_names can be configured in several ways:
'replica1': Onlyreplica1must confirm.'1 (replica1, replica2)': A quorum of 1 out of 2 replicas must confirm.'ANY 1 (replica1, replica2)': At least 1 of the listed replicas must confirm.'FIRST 2 (replica1, replica2)': The first 2 replicas in the list must confirm.
Restart PostgreSQL on both primary and replica for changes to take effect.
Automated Failover with Patroni
Manual failover is error-prone and slow. Patroni is an open-source template for PostgreSQL high availability that manages replication, failover, and configuration. It uses a distributed configuration store (like etcd, Consul, or ZooKeeper) to maintain cluster state.
Setting up etcd
For this example, we’ll use etcd. Deploy a small etcd cluster (at least 3 nodes for production). On OVH, you can deploy VMs and install etcd.
# Example etcd installation on Ubuntu sudo apt update sudo apt install etcd # Configure etcd (e.g., /etc/etcd/etcd.conf.yml) # Ensure it's clustered and accessible by Patroni nodes. # For simplicity, a single node etcd for testing: ETCD_NAME=etcd0 ETCD_DATA_DIR="/var/lib/etcd" ETCD_LISTEN_PEER_URLS="http://:2380" ETCD_LISTEN_CLIENT_URLS="http:// :2379,http://127.0.0.1:2379" ETCD_INITIAL_ADVERTISE_PEER_URLS="http:// :2380" ETCD_ADVERTISE_CLIENT_URLS="http:// :2379" ETCD_INITIAL_CLUSTER="etcd0=http:// :2380" ETCD_INITIAL_CLUSTER_STATE="new" ETCD_DISCOVERY="" ETCD_LOG_PACKAGE_LEVEL="info" # Start etcd sudo systemctl enable etcd sudo systemctl start etcd
Installing and Configuring Patroni
Install Patroni and its dependencies (e.g., psycopg2 for Python). We’ll run Patroni on each PostgreSQL node.
# Install Patroni (using pip is common) sudo apt install python3-pip python3-dev build-essential libpq-dev sudo pip3 install psycopg2-binary 'patroni[etcd]' # Create a configuration file for Patroni (e.g., /etc/patroni/patroni.yml)
patroni.yml (Example)
scope: my_pg_cluster # Unique identifier for this cluster
namespace: /service/ # Base path in etcd
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/14/main # Ensure this matches your PG data dir
bin_dir: /usr/lib/postgresql/14/bin # Adjust path
pg_hba:
- host replication replicator 0.0.0.0/0 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
replication:
username: replicator
password: your_replication_password
parameters:
max_connections: 100
shared_buffers: 256MB
effective_cache_size: 768MB
maintenance_work_mem: 64MB
synchronous_commit: remote_write
synchronous_standby_names: '1 (replica1,replica2)' # Example, adjust as needed
# Distributed configuration store (etcd)
etcd:
hosts:
- :2379
- :2379
- :2379
protocol: http # or https if using TLS
# Replication settings for Patroni
replication:
synchronous_mode: true
synchronous_node_count: 1 # Number of synchronous replicas required
# REST API for monitoring and control
restapi:
listen: 0.0.0.0:8008
connect_address: :8008
# Tags for identifying nodes
tags:
nofailover: false
clonefrom: false
Ensure the replicator user exists and has replication privileges. Patroni will manage the pg_hba.conf based on its configuration, so ensure it’s writable by the PostgreSQL user.
Running Patroni as a Service
Create a systemd service file for Patroni.
# /etc/systemd/system/patroni.service [Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target
Enable and start the service on all PostgreSQL nodes.
sudo systemctl daemon-reload sudo systemctl enable patroni sudo systemctl start patroni
WordPress Integration and Load Balancing
WordPress applications need to connect to the *current* primary PostgreSQL instance. This requires a dynamic way to discover the primary’s IP address. A common approach is to use a load balancer or a proxy that can be updated when a failover occurs.
Using HAProxy for Dynamic Backend Updates
HAProxy can monitor the PostgreSQL cluster’s health via Patroni’s REST API and automatically direct traffic to the active primary. We’ll configure HAProxy to check the Patroni API for the primary’s status.
HAProxy Configuration
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
listen pg_cluster
bind *:5432
mode tcp
option tcp-check
# Use a script to dynamically fetch the primary IP from Patroni
tcp-check connect port 5432
tcp-check send GET_PRIMARY\r\n
tcp-check expect ! rcvfail
# This is a simplified example. A more robust solution would involve a script
# that queries Patroni's API and updates HAProxy's backend list.
# For dynamic backend updates, HAProxy can be reloaded or use its API.
# A common pattern is to have a script that periodically queries Patroni
# and updates a backend list file that HAProxy reads, or uses the HAProxy API.
# Example using a script to update backends (requires HAProxy 1.8+ for stats socket scripting)
# Or, use a separate script that reloads HAProxy configuration.
# For simplicity, let's assume a static list and rely on Patroni's health checks.
# A better approach is to use the HAProxy API or a dynamic configuration generator.
# Example using a script to update the backend list dynamically:
# The script would query Patroni's API, get the primary IP, and then
# use the HAProxy stats socket to add/remove backends.
# For a more direct approach, let's configure health checks that rely on Patroni's API.
# This requires a custom health check script.
# Alternative: Use a dedicated PostgreSQL proxy like pgpool-II or ProxySQL,
# which have built-in support for PostgreSQL HA and can be configured to
# query Patroni or etcd.
# For HAProxy, a common pattern is to have a script that:
# 1. Queries Patroni API for the primary node.
# 2. Generates a new HAProxy configuration snippet with the primary as a backend.
# 3. Reloads HAProxy.
# Example of a script that could be run by cron:
# (See below for script example)
# For demonstration, let's assume a single primary and two replicas,
# and HAProxy will try to connect to them, relying on Patroni to ensure
# only the primary is healthy.
server pg_primary :5432 check port 5432 inter 2s fall 3 rise 2
server pg_replica1 :5432 check port 5432 inter 2s fall 3 rise 2
server pg_replica2 :5432 check port 5432 inter 2s fall 3 rise 2
# Example of a script to dynamically update HAProxy backends
# This script would be run by cron or a similar scheduler.
# It queries Patroni's REST API to find the primary and then updates HAProxy.
# Script: /usr/local/bin/update_pg_haproxy.sh
#!/bin/bash
PATRONI_API_URL="http://:8008/primary"
HAPROXY_CONFIG_FILE="/etc/haproxy/conf.d/pg_backends.cfg"
HAPROXY_RELOAD_CMD="sudo systemctl reload haproxy"
# Get primary IP from Patroni API
PRIMARY_IP=$(curl -s "$PATRONI_API_URL" | jq -r '.host')
if [ -z "$PRIMARY_IP" ]; then
echo "Could not retrieve primary IP from Patroni."
exit 1
fi
# Generate new backend configuration
echo "server pg_primary $PRIMARY_IP:5432 check port 5432 inter 2s fall 3 rise 2" > "$HAPROXY_CONFIG_FILE"
echo "server pg_replica1 :5432 check port 5432 inter 2s fall 3 rise 2" >> "$HAPROXY_CONFIG_FILE"
echo "server pg_replica2 :5432 check port 5432 inter 2s fall 3 rise 2" >> "$HAPROXY_CONFIG_FILE"
echo "Updated HAProxy backends to point to $PRIMARY_IP"
# Reload HAProxy
$HAPROXY_RELOAD_CMD
Ensure the jq utility is installed for JSON parsing. The /etc/haproxy/conf.d/pg_backends.cfg file should be included in your main haproxy.cfg using an include directive.
WordPress Configuration
In your wp-config.php, point the database connection to the HAProxy listener.
// wp-config.php define( 'DB_HOST', '127.0.0.1:5432' ); // Or the IP of your HAProxy instance if not on the same server define( 'DB_NAME', 'your_wordpress_db' ); define( 'DB_USER', 'your_wordpress_user' ); define( 'DB_PASSWORD', 'your_wordpress_password' ); define( 'DB_CHARSET', 'utf8' ); define( 'DB_COLLATE', '' );
When a failover occurs, Patroni promotes a replica. The health check script will detect the change in primary status (e.g., by querying /primary endpoint on Patroni) and update HAProxy to point to the new primary. WordPress will then seamlessly connect to the new primary.
Testing the Failover
To test the automated failover:
- Gracefully stop the primary PostgreSQL process on the current primary node (e.g.,
sudo systemctl stop postgresql). - Monitor Patroni’s logs and etcd for failover events.
- Observe HAProxy’s logs and the
pg_backends.cfgfile to confirm it has updated to point to the new primary. - Verify WordPress can still connect and operate normally.
- To restore the original primary, you’ll need to reconfigure it as a replica of the new primary and then potentially trigger another failover or re-promote it after it has caught up.
For more advanced scenarios, consider using tools like ProxySQL or pgpool-II, which offer more sophisticated connection pooling, query routing, and HA management features specifically for PostgreSQL.