Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WooCommerce Deployments on Linode

Establishing a High-Availability PostgreSQL Cluster with Patroni

For a robust WooCommerce deployment, a highly available PostgreSQL database is non-negotiable. We’ll architect an automated failover solution using Patroni, a template for PostgreSQL HA. This setup will involve at least two PostgreSQL nodes and a distributed configuration store, typically etcd or Consul. For simplicity and ease of management on Linode, we’ll focus on etcd.

Our goal is to have a primary PostgreSQL instance and one or more replica instances. Patroni will manage the replication and orchestrate failover when the primary becomes unavailable. This involves configuring Patroni agents on each PostgreSQL node and ensuring they can communicate with the etcd cluster.

Setting up etcd for Configuration and Coordination

A minimal etcd cluster for this purpose can consist of three nodes for quorum. These can be provisioned on separate Linode instances or, for smaller deployments, co-located on dedicated nodes that are *not* running PostgreSQL. Ensure etcd is accessible over a secure network connection.

On each etcd node, install etcd (e.g., via `apt` or by downloading binaries). A basic configuration file (`etcd.conf.yml`) might look like this:

name: etcd-node-1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://:2380
initial-cluster: etcd-node-1=http://:2380,etcd-node-2=http://:2380,etcd-node-3=http://:2380
initial-cluster-state: new
cluster-activate-timeout: 10s

Replace `` with the actual IP addresses of your etcd nodes. Start the etcd service and verify cluster health:

sudo systemctl start etcd
sudo systemctl enable etcd
etcdctl member list

Installing and Configuring Patroni on PostgreSQL Nodes

On each Linode instance designated for PostgreSQL, install PostgreSQL and Patroni. The installation method will vary by OS. For Debian/Ubuntu:

sudo apt update
sudo apt install postgresql postgresql-contrib python3-pip
sudo pip3 install "patroni[etcd]"

Next, create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`) on each PostgreSQL node. This configuration is crucial and must be consistent across all nodes, with minor variations for node-specific IPs.

# /etc/patroni/patroni.yml on PostgreSQL Node 1
scope: my_pg_cluster
namespace: /service/

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  connect_address: :5432
  data_dir: /var/lib/postgresql/14/main # Adjust version and path as needed
  pg_hba:
    - host replication replicator /24 md5
    - host all all 0.0.0.0/0 md5 # Adjust for security in production
  replication:
    username: replicator
    password: &REPL_PASS your_replication_password
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    effective_cache_size: 768MB
    maintenance_work_mem: 64MB
    wal_level: replica
    wal_sync_method: fsync
    wal_writer_delay: 200ms
    checkpoint_completion_target: 0.9
    random_page_cost: 1.1
    effective_io_concurrency: 200
    work_mem: 16MB
    min_wal_size: 1GB
    max_wal_size: 4GB
    default_statistics_target: 100
    log_destination: stderr
    logging_collector: on
    log_directory: pg_log
    log_filename: postgresql-%Y-%m-%d.log
    log_min_duration_statement: 250ms
    log_checkpoints: on
    log_connections: on
    log_disconnections: on
    log_lock_waits: on
    log_temp_files: 0
    log_autovacuum_min_duration: 0
    log_error_verbosity: default
    log_statement: 'ddl' # Or 'all' for more verbose logging

# etcd configuration
etcd:
  hosts:
    - :2379
    - :2379
    - :2379
  protocol: http # Use https for production with TLS

# Replication configuration
replication:
  synchronous_mode: false # Set to true for synchronous replication if latency allows
  synchronous_node_count: 1 # If synchronous_mode is true

# Tags for node identification
tags:
  nofailover: false # Set to true for standby nodes that should not be promoted
  clonefrom: true # Allows cloning from another node if data_dir is empty

# REST API configuration
restapi:
  listen: 0.0.0.0:8008
  connect_address: :8008

# Logging configuration
log:
  level: INFO
  dir: /var/log/patroni
  file: patroni.log

# Callback scripts (optional, for advanced integration)
# callbacks:
#   on_role_change: /path/to/your/on_role_change.sh

Key points for the configuration:

scope: A unique name for your PostgreSQL cluster.
namespace: The etcd path where Patroni will store cluster state.
postgresql.connect_address: The IP address and port Patroni will use to connect to this PostgreSQL instance. This must be unique per node.
postgresql.pg_hba: Crucial for replication. Ensure the replicator user can connect from replica nodes.
postgresql.replication.password: The password for the replication user. Use a strong, unique password.
postgresql.parameters: Tune these for your workload. wal_level must be replica.
etcd.hosts: List of your etcd cluster endpoints.
replication.synchronous_mode: Set to true for higher data consistency guarantees, but be mindful of latency.
tags.nofailover: On replica nodes, you might set this to true if you want to manually control promotions or have specific standby nodes.
restapi.connect_address: The IP and port for Patroni’s REST API.

Create the necessary directories and set permissions:

sudo mkdir -p /etc/patroni
sudo chown -R postgres:postgres /etc/patroni
sudo mkdir -p /var/log/patroni
sudo chown -R postgres:postgres /var/log/patroni
sudo mkdir -p /var/lib/postgresql/14/main # Ensure this matches your data_dir
sudo chown -R postgres:postgres /var/lib/postgresql/14/main

Create a systemd service file for Patroni (e.g., `/etc/systemd/system/patroni.service`):

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
KillMode=process
TimeoutSec=30

[Install]
WantedBy=multi-user.target

Reload systemd, enable, and start Patroni on all PostgreSQL nodes:

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

Verifying the HA Cluster and Failover Mechanism

After starting Patroni on all nodes, the cluster should self-organize. You can check the status via the Patroni REST API or by inspecting etcd.

Check etcd for cluster state:

etcdctl get /service/my_pg_cluster/state --prefix --keys-only

You should see entries like:

/service/my_pg_cluster/cluster
/service/my_pg_cluster/leader
/service/my_pg_cluster/members/node1_uuid/state
/service/my_pg_cluster/members/node2_uuid/state
...

The /leader key will indicate which node is currently the primary. You can also query the REST API:

curl http://:8008/cluster

This will return a JSON object detailing the cluster state, including the primary and replica nodes.

Simulating a Failover

To test the automatic failover, stop the Patroni service on the current primary node:

sudo systemctl stop patroni

Monitor the logs on the other PostgreSQL nodes. Patroni will detect the primary’s absence, and one of the replicas will be promoted. This process typically takes 30-60 seconds, depending on your etcd and PostgreSQL configurations.

After a successful failover, verify the new primary using the REST API or etcd. You can then restart Patroni on the original primary node; it should rejoin the cluster as a replica.

Architecting WooCommerce for High Availability with PostgreSQL

Connecting WooCommerce to a highly available PostgreSQL cluster requires a slight shift in how your application connects. Instead of a direct IP to a single database server, WooCommerce needs to connect to a virtual IP or a load balancer that always points to the *current* primary PostgreSQL instance.

Using a Virtual IP (VIP) with Keepalived

A common and effective method is to use Keepalived. Keepalived provides a Virtual Router Redundancy Protocol (VRRP) implementation, allowing a floating IP address to be managed by a cluster of servers. In our case, this cluster will be the PostgreSQL nodes themselves.

Install Keepalived on each PostgreSQL node:

sudo apt update
sudo apt install keepalived

Configure Keepalived. The configuration file is typically `/etc/keepalived/keepalived.conf`. You’ll need to define a VRRP instance that manages the VIP. This configuration should be identical on all PostgreSQL nodes, with the exception of the `priority` and `state` parameters.

# /etc/keepalived/keepalived.conf on PostgreSQL Node 1 (Primary Candidate)
vrrp_script pg_check {
    script "/usr/local/bin/check_pg_primary.sh"
    interval 2
    weight 20
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER # Set to BACKUP on other nodes
    interface eth0 # Adjust to your primary network interface
    virtual_router_id 51
    priority 150 # Higher priority for MASTER, lower for BACKUP
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass your_vrrp_password
    }
    virtual_ipaddress {
        /24 dev eth0 # The floating IP for your database
    }
    track_script {
        pg_check
    }
}

Explanation:

vrrp_script pg_check: This defines a script that Keepalived will run periodically to check the health of the PostgreSQL primary.
vrrp_instance VI_1: Defines the VRRP instance.
state MASTER/BACKUP: Determines which node initially owns the VIP.
priority: Higher value means higher priority. The node with the highest priority and a healthy check script will own the VIP.
virtual_router_id: A unique identifier for this VRRP group.
virtual_ipaddress: The floating IP address that applications will connect to.
track_script: Tells Keepalived to monitor the health of the PostgreSQL primary via the script.

Create the health check script (`/usr/local/bin/check_pg_primary.sh`):

#!/bin/bash

# Check if this node is the primary according to Patroni
PATRONI_API_URL="http://localhost:8008"
IS_PRIMARY=$(curl -s "$PATRONI_API_URL/primary" | jq -r '.state // "unknown"')

if [ "$IS_PRIMARY" == "true" ]; then
    # Check if PostgreSQL is actually running and accepting connections
    PG_PORT=$(grep -A 1 'listen:' /etc/patroni/patroni.yml | tail -n 1 | awk '{print $2}')
    PG_PORT=${PG_PORT:-5432} # Default to 5432 if not found
    pg_isready -h 127.0.0.1 -p $PG_PORT -q
    if [ $? -eq 0 ]; then
        exit 0 # PostgreSQL is primary and healthy
    else
        exit 1 # PostgreSQL is not healthy
    fi
else
    exit 1 # Not the primary
fi

Make the script executable:

sudo chmod +x /usr/local/bin/check_pg_primary.sh

Ensure `jq` is installed for parsing JSON output from `curl`.

Restart Keepalived on all PostgreSQL nodes:

sudo systemctl restart keepalived

Verify that the VIP is assigned to the current primary node. You can check this using `ip addr show eth0` (replace `eth0` with your interface). When the primary fails, Keepalived should automatically move the VIP to a healthy replica.

Configuring WooCommerce Database Connection

In your WooCommerce `wp-config.php` file, update the database connection details to use the virtual IP address:

define( 'DB_HOST', ':5432' );
define( 'DB_NAME', 'your_database_name' );
define( 'DB_USER', 'your_woocommerce_db_user' );
define( 'DB_PASSWORD', 'your_woocommerce_db_password' );
define( 'DB_CHARSET', 'utf8mb4' );
define( 'DB_COLLATE', '' );

WooCommerce will now connect to the VIP. When a failover occurs, and the VIP moves to a new primary, WooCommerce will automatically connect to the new primary without any application-level changes, provided the database user and password remain the same.

Alternative: Load Balancer Approach

For more complex scenarios or if you prefer a dedicated load balancing solution, you could place a load balancer (e.g., HAProxy, Nginx Plus, or a cloud provider’s managed load balancer) in front of your PostgreSQL cluster. The load balancer would need to be configured to:

Monitor the health of PostgreSQL instances.
Direct traffic only to the current primary.
Update its routing rules dynamically as Patroni promotes new primaries.

This often involves integrating the load balancer with Patroni’s API or using a custom script that polls Patroni and reconfigures the load balancer. While more flexible, it adds another layer of complexity and potential failure points compared to the Keepalived VIP approach for simpler setups.

Monitoring and Maintenance

A robust HA setup requires continuous monitoring. Key metrics to track include:

PostgreSQL replication lag.
Patroni and etcd cluster health.
Keepalived status and VIP ownership.
Resource utilization (CPU, memory, disk I/O) on all nodes.
Network latency between nodes.

Tools like Prometheus with PostgreSQL and etcd exporters, along with Grafana for visualization, are invaluable. Set up alerts for any deviations from normal operating parameters.

Regularly test your failover procedures. Schedule maintenance windows to perform manual failovers, test upgrades, and ensure your disaster recovery plan remains effective. Documenting the entire process, including rollback procedures, is critical for operational readiness.