Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WooCommerce Deployments on OVH
Establishing PostgreSQL High Availability with Patroni and etcd
For mission-critical WooCommerce deployments, PostgreSQL’s availability is paramount. A robust disaster recovery strategy hinges on automated failover. We’ll leverage Patroni, a template for PostgreSQL HA, in conjunction with etcd for distributed configuration and leader election. This setup ensures that if the primary PostgreSQL instance fails, a replica is automatically promoted with minimal downtime.
Our architecture will consist of at least three PostgreSQL nodes, each running Patroni. etcd will also be deployed as a cluster (typically 3 or 5 nodes) for resilience. Patroni monitors the health of PostgreSQL instances and uses etcd to manage cluster state, including identifying the current primary and orchestrating failover.
Prerequisites and Setup
Ensure you have a working etcd cluster. For this example, we assume etcd is accessible at etcd.yourdomain.com:2379. Each PostgreSQL node will require Patroni installed. The installation method depends on your OS; for Debian/Ubuntu, apt install patroni postgresql-client is common. Ensure PostgreSQL is configured to allow replication and superuser access for Patroni.
Patroni Configuration (patroni.yml)
Each Patroni instance will share a similar configuration file. Key parameters include connection details for PostgreSQL, etcd, and replication settings.
# patroni.yml on each PostgreSQL node
scope: my_woocommerce_cluster
namespace: /service/
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.1.10[node_ip]:5432 # Replace with actual node IP
etcd:
host: etcd.yourdomain.com:2379
protocol: http # or https if using TLS
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.1.10[node_ip]:5432 # Replace with actual node IP
data_dir: /var/lib/postgresql/14/main # Adjust path as per your PostgreSQL installation
bin_dir: /usr/lib/postgresql/14/bin # Adjust path as per your PostgreSQL installation
pg_hba:
- host replication all replicator_user/32_bit_mask(e.g., 0/0) trust
- host replication all repl_user_ip_address/32_bit_mask(e.g., 192.168.1.0/24) trust
- host all all 0.0.0.0/0 md5 # For application access
replication:
username: replicator_user
password: your_replication_password
ssl: false # Set to true if using SSL
authentication:
superuser:
username: postgres
password: your_postgres_password
replication:
username: replicator_user
password: your_replication_password
# Tags for node identification
tags:
nofailover: false
clonefrom: false
Important Notes:
- Replace
my_woocommerce_clusterwith a unique name for your cluster. namespacein etcd should be consistent across all Patroni instances.connect_addressmust be the IP address of the specific node Patroni is running on.- Ensure the
replicator_userexists in PostgreSQL and has replication privileges. - The
pg_hba.confentries are crucial for allowing replication and application connections. Adjust IP ranges and user privileges as needed for your network security policy. data_dirandbin_dirmust point to your PostgreSQL installation’s directories.
Starting Patroni and PostgreSQL
Once patroni.yml is configured on each node, start Patroni. Patroni will then manage the lifecycle of the PostgreSQL instances. The first node to start will typically initialize itself as the primary.
# On each PostgreSQL node sudo systemctl start patroni sudo systemctl enable patroni
You can verify the cluster status using Patroni’s REST API or by inspecting etcd keys.
# Example: Check cluster status from any node curl http://localhost:8008/cluster
The output should indicate the current primary and any replicas. If a node fails to start, check Patroni logs (e.g., journalctl -u patroni -f) and PostgreSQL logs for errors.
Configuring WooCommerce for Auto-Failover
WooCommerce, by default, connects directly to a single PostgreSQL instance. To achieve auto-failover, we need a mechanism that abstracts the database endpoint and redirects traffic to the current primary. This is typically achieved using a load balancer or a virtual IP (VIP) managed by a cluster resource manager.
Option 1: Using HAProxy as a Database Proxy
HAProxy can be configured to monitor the Patroni REST API and direct traffic to the active PostgreSQL primary. This provides a single, stable endpoint for WooCommerce.
HAProxy Configuration (haproxy.cfg)
# haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
listen patroni_api
bind *:8009
mode http
# This is a simplified check. A more robust check would query /primary
# and verify the leader's IP.
option httpchk GET /primary
http-check expect status 200
frontend pgsql_frontend
bind *:5433 # The port WooCommerce will connect to
mode tcp
default_backend pgsql_backend
backend pgsql_backend
mode tcp
balance roundrobin
# Use Patroni's API to discover the primary
# This requires a custom script or a more advanced HAProxy configuration
# For simplicity, we'll list potential primaries and rely on health checks.
# A more dynamic approach is recommended for production.
server pgsql1 192.168.1.101:5432 check port 8008 # Node 1 Patroni API port
server pgsql2 192.168.1.102:5432 check port 8008 # Node 2 Patroni API port
server pgsql3 192.168.1.103:5432 check port 8008 # Node 3 Patroni API port
# Health check for PostgreSQL itself
option pgsql-check user postgres
# If using a dedicated health check user:
# user health_check_user password health_check_password
# A more advanced approach using a script to query Patroni API:
# In HAProxy 2.0+, you can use 'http-request use-tag' with external scripts.
# For older versions, a dedicated service like 'patronictl' or a custom agent
# would be needed to update HAProxy's backend server list dynamically.
Dynamic Backend Configuration (Advanced):
The static backend configuration above is not ideal for true auto-failover. A more robust solution involves a script that periodically queries the Patroni API (e.g., http://localhost:8008/primary) to get the current primary’s IP and port. This script would then dynamically update HAProxy’s configuration and reload HAProxy. Alternatively, tools like patronictl discover can be integrated.
Here’s a conceptual Python script snippet to achieve this:
import requests
import subprocess
import time
import os
PATRONI_API_URL = "http://localhost:8008/primary" # Or the IP of a Patroni node
HAPROXY_CONFIG_PATH = "/etc/haproxy/haproxy.cfg"
HAPROXY_RELOAD_CMD = ["systemctl", "reload", "haproxy"]
CHECK_INTERVAL = 10 # seconds
def get_primary_info():
try:
response = requests.get(PATRONI_API_URL, timeout=5)
response.raise_for_status()
data = response.json()
return data.get("host"), data.get("port")
except requests.exceptions.RequestException as e:
print(f"Error querying Patroni API: {e}")
return None, None
def update_haproxy_config(primary_host, primary_port):
# Read the current HAProxy config
with open(HAPROXY_CONFIG_PATH, 'r') as f:
lines = f.readlines()
new_lines = []
in_backend = False
primary_found = False
for line in lines:
if line.strip().startswith("backend pgsql_backend"):
in_backend = True
new_lines.append(line)
continue
if in_backend and line.strip().startswith("server "):
# This is a server line, skip it if it's not the primary
if not line.strip().startswith(f"server pgsql_primary {primary_host}:{primary_port}"):
continue
else:
primary_found = True
new_lines.append(f"server pgsql_primary {primary_host}:{primary_port} check port 8008\n") # Keep health check
elif in_backend and not line.strip().startswith("server "):
# End of backend section or other directives
in_backend = False
new_lines.append(line)
else:
new_lines.append(line)
# If primary wasn't found (e.g., first run or config changed), add it
if in_backend and not primary_found:
new_lines.append(f"server pgsql_primary {primary_host}:{primary_port} check port 8008\n")
# Write the new config if it's different
if "".join(new_lines) != "".join(lines):
with open(HAPROXY_CONFIG_PATH, 'w') as f:
f.writelines(new_lines)
print("HAProxy config updated. Reloading...")
try:
subprocess.run(HAPROXY_RELOAD_CMD, check=True)
print("HAProxy reloaded successfully.")
except subprocess.CalledProcessError as e:
print(f"Error reloading HAProxy: {e}")
else:
print("HAProxy config unchanged.")
if __name__ == "__main__":
print("Starting HAProxy dynamic configuration updater...")
while True:
host, port = get_primary_info()
if host and port:
update_haproxy_config(host, port)
else:
print("Could not determine primary. Retrying...")
time.sleep(CHECK_INTERVAL)
This script needs to be run on the HAProxy server and configured to start on boot. The PATRONI_API_URL should point to a reliable Patroni instance. The script assumes a basic HAProxy config structure and needs adaptation for more complex setups. Ensure the HAProxy user has write permissions to the config file and can execute systemctl reload haproxy.
WooCommerce Connection String
Update your wp-config.php to point to the HAProxy listener.
// wp-config.php define( 'DB_HOST', 'localhost:5433' ); // HAProxy listener IP and port define( 'DB_NAME', 'your_woocommerce_db' ); define( 'DB_USER', 'your_db_user' ); define( 'DB_PASSWORD', 'your_db_password' ); define( 'DB_CHARSET', 'utf8mb4' ); define( 'DB_COLLATE', '' );
Option 2: Using a Virtual IP (VIP) with Keepalived
Another common approach is to use Keepalived to manage a Virtual IP address that floats between the PostgreSQL nodes. WooCommerce applications would connect to this VIP. Keepalived uses VRRP (Virtual Router Redundancy Protocol) to achieve this.
Keepalived Configuration (keepalived.conf)
On each PostgreSQL node, configure Keepalived. One node will initially hold the VIP. If that node fails, Keepalived on another node will take over the VIP.
# keepalived.conf on each PostgreSQL node
vrrp_script chk_postgres {
script "/usr/local/bin/check_postgres.sh" # Custom script to check PostgreSQL health
interval 2
weight 20 # Higher weight for primary
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP # Start as BACKUP, will transition to MASTER if primary is healthy
interface eth0 # Network interface to bind VIP to
virtual_router_id 51
priority 100 # Lower priority for backup nodes
advert_int 1
authentication {
auth_type PASS
auth_pass your_vrrp_password
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 # The floating VIP
}
track_script {
chk_postgres
}
}
/usr/local/bin/check_postgres.sh script:
#!/bin/bash
# Checks if PostgreSQL is running and accepting connections as primary.
# This script should be adapted to check Patroni's status or query PG directly.
PG_USER="postgres"
PG_HOST="127.0.0.1"
PG_PORT="5432"
# Check if Patroni reports this node as primary
# This is a more reliable check than just checking if PG is running.
# You might need to adjust the URL based on your Patroni setup.
PATRONI_PRIMARY_URL="http://localhost:8008/primary"
if curl --fail --silent "$PATRONI_PRIMARY_URL" > /dev/null; then
# Patroni reports this node as primary, or at least it's reachable.
# Further check if PostgreSQL is actually responding.
pg_isready -h $PG_HOST -p $PG_PORT -U $PG_USER >/dev/null 2>&1
if [ $? -eq 0 ]; then
exit 0 # PostgreSQL is ready and this node is likely primary
else
exit 1 # PostgreSQL not ready
fi
else
# Patroni does not report this node as primary.
# If you want the VIP to float even if PG is down but Patroni is up,
# you might adjust this logic. For HA, we want the VIP on the active primary.
exit 1
fi
Ensure the check_postgres.sh script is executable and has the correct permissions. The weight in chk_postgres should be higher for the intended primary node(s) to ensure it gets the VIP first. The priority in vrrp_instance determines which node becomes MASTER if multiple nodes are equally healthy; higher priority wins. Ensure your_vrrp_password is strong and consistent across nodes.
WooCommerce Connection String with VIP
Configure wp-config.php to use the VIP.
// wp-config.php define( 'DB_HOST', '192.168.1.100' ); // The floating VIP define( 'DB_NAME', 'your_woocommerce_db' ); define( 'DB_USER', 'your_db_user' ); define( 'DB_PASSWORD', 'your_db_password' ); define( 'DB_CHARSET', 'utf8mb4' ); define( 'DB_COLLATE', '' );
Testing Failover Scenarios
Thorough testing is critical. Simulate failures to ensure the automated failover works as expected.
- Simulate PostgreSQL Node Failure: Stop the PostgreSQL service on the current primary node (
sudo systemctl stop postgresql). Observe Patroni’s logs and etcd state to confirm a replica is promoted. Check if HAProxy or Keepalived redirects traffic to the new primary. Test WooCommerce by performing a read and write operation. - Simulate Patroni Service Failure: Stop the Patroni service on the primary node (
sudo systemctl stop patroni). Patroni should detect this and initiate a failover. - Simulate Network Partition: Isolate a PostgreSQL node from the etcd cluster and the other PostgreSQL nodes. Observe how Patroni handles this.
- Simulate etcd Node Failure: If using an etcd cluster, stop one or more etcd nodes to test its resilience.
During failover, expect a brief period of unavailability (seconds to a minute, depending on configuration and network latency). Monitor application logs for connection errors during this window.
Monitoring and Maintenance
Continuous monitoring is essential for a reliable HA setup.
- Patroni Health: Monitor the Patroni REST API endpoints (
/cluster,/primary,/replicas) for status. - etcd Health: Monitor etcd cluster health, leader election, and key presence.
- HAProxy/Keepalived Health: Monitor HAProxy stats and Keepalived status.
- PostgreSQL Metrics: Track replication lag, connection counts, query performance, and disk I/O.
- Log Aggregation: Centralize logs from PostgreSQL, Patroni, HAProxy, and Keepalived for easier debugging.
Regularly test your failover procedures and keep your Patroni, PostgreSQL, etcd, HAProxy, and Keepalived versions up-to-date. Consider implementing automated backups and point-in-time recovery (PITR) as a complementary disaster recovery strategy.