Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WooCommerce Deployments on Linode
Establishing a High-Availability PostgreSQL Cluster with Patroni
For a robust WooCommerce deployment, a highly available PostgreSQL database is non-negotiable. We’ll architect an automated failover solution using Patroni, a template for PostgreSQL HA. This setup will involve at least two PostgreSQL nodes and a distributed configuration store, typically etcd or Consul. For simplicity and ease of management on Linode, we’ll focus on etcd.
Our goal is to have a primary PostgreSQL instance and one or more replica instances. Patroni will manage the replication and orchestrate failover when the primary becomes unavailable. This involves configuring Patroni agents on each PostgreSQL node and ensuring they can communicate with the etcd cluster.
Setting up etcd for Configuration and Coordination
A minimal etcd cluster for this purpose can consist of three nodes for quorum. These can be provisioned on separate Linode instances or, for smaller deployments, co-located on dedicated nodes that are *not* running PostgreSQL. Ensure etcd is accessible over a secure network connection.
On each etcd node, install etcd (e.g., via `apt` or by downloading binaries). A basic configuration file (`etcd.conf.yml`) might look like this:
name: etcd-node-1 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://:2379 listen-peer-urls: http://0.0.0.0:2380 initial-advertise-peer-urls: http:// :2380 initial-cluster: etcd-node-1=http:// :2380,etcd-node-2=http:// :2380,etcd-node-3=http:// :2380 initial-cluster-state: new cluster-activate-timeout: 10s
Replace `
sudo systemctl start etcd sudo systemctl enable etcd etcdctl member list
Installing and Configuring Patroni on PostgreSQL Nodes
On each Linode instance designated for PostgreSQL, install PostgreSQL and Patroni. The installation method will vary by OS. For Debian/Ubuntu:
sudo apt update sudo apt install postgresql postgresql-contrib python3-pip sudo pip3 install "patroni[etcd]"
Next, create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`) on each PostgreSQL node. This configuration is crucial and must be consistent across all nodes, with minor variations for node-specific IPs.
# /etc/patroni/patroni.yml on PostgreSQL Node 1 scope: my_pg_cluster namespace: /service/ # PostgreSQL configuration postgresql: listen: 0.0.0.0:5432 connect_address::5432 data_dir: /var/lib/postgresql/14/main # Adjust version and path as needed pg_hba: - host replication replicator /24 md5 - host all all 0.0.0.0/0 md5 # Adjust for security in production replication: username: replicator password: &REPL_PASS your_replication_password parameters: max_connections: 100 shared_buffers: 256MB effective_cache_size: 768MB maintenance_work_mem: 64MB wal_level: replica wal_sync_method: fsync wal_writer_delay: 200ms checkpoint_completion_target: 0.9 random_page_cost: 1.1 effective_io_concurrency: 200 work_mem: 16MB min_wal_size: 1GB max_wal_size: 4GB default_statistics_target: 100 log_destination: stderr logging_collector: on log_directory: pg_log log_filename: postgresql-%Y-%m-%d.log log_min_duration_statement: 250ms log_checkpoints: on log_connections: on log_disconnections: on log_lock_waits: on log_temp_files: 0 log_autovacuum_min_duration: 0 log_error_verbosity: default log_statement: 'ddl' # Or 'all' for more verbose logging # etcd configuration etcd: hosts: - :2379 - :2379 - :2379 protocol: http # Use https for production with TLS # Replication configuration replication: synchronous_mode: false # Set to true for synchronous replication if latency allows synchronous_node_count: 1 # If synchronous_mode is true # Tags for node identification tags: nofailover: false # Set to true for standby nodes that should not be promoted clonefrom: true # Allows cloning from another node if data_dir is empty # REST API configuration restapi: listen: 0.0.0.0:8008 connect_address: :8008 # Logging configuration log: level: INFO dir: /var/log/patroni file: patroni.log # Callback scripts (optional, for advanced integration) # callbacks: # on_role_change: /path/to/your/on_role_change.sh
Key points for the configuration:
scope: A unique name for your PostgreSQL cluster.namespace: The etcd path where Patroni will store cluster state.postgresql.connect_address: The IP address and port Patroni will use to connect to this PostgreSQL instance. This must be unique per node.postgresql.pg_hba: Crucial for replication. Ensure thereplicatoruser can connect from replica nodes.postgresql.replication.password: The password for the replication user. Use a strong, unique password.postgresql.parameters: Tune these for your workload.wal_levelmust bereplica.etcd.hosts: List of your etcd cluster endpoints.replication.synchronous_mode: Set totruefor higher data consistency guarantees, but be mindful of latency.tags.nofailover: On replica nodes, you might set this totrueif you want to manually control promotions or have specific standby nodes.restapi.connect_address: The IP and port for Patroni’s REST API.
Create the necessary directories and set permissions:
sudo mkdir -p /etc/patroni sudo chown -R postgres:postgres /etc/patroni sudo mkdir -p /var/log/patroni sudo chown -R postgres:postgres /var/log/patroni sudo mkdir -p /var/lib/postgresql/14/main # Ensure this matches your data_dir sudo chown -R postgres:postgres /var/lib/postgresql/14/main
Create a systemd service file for Patroni (e.g., `/etc/systemd/system/patroni.service`):
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure KillMode=process TimeoutSec=30 [Install] WantedBy=multi-user.target
Reload systemd, enable, and start Patroni on all PostgreSQL nodes:
sudo systemctl daemon-reload sudo systemctl enable patroni sudo systemctl start patroni
Verifying the HA Cluster and Failover Mechanism
After starting Patroni on all nodes, the cluster should self-organize. You can check the status via the Patroni REST API or by inspecting etcd.
Check etcd for cluster state:
etcdctl get /service/my_pg_cluster/state --prefix --keys-only
You should see entries like:
/service/my_pg_cluster/cluster /service/my_pg_cluster/leader /service/my_pg_cluster/members/node1_uuid/state /service/my_pg_cluster/members/node2_uuid/state ...
The /leader key will indicate which node is currently the primary. You can also query the REST API:
curl http://:8008/cluster
This will return a JSON object detailing the cluster state, including the primary and replica nodes.
Simulating a Failover
To test the automatic failover, stop the Patroni service on the current primary node:
sudo systemctl stop patroni
Monitor the logs on the other PostgreSQL nodes. Patroni will detect the primary’s absence, and one of the replicas will be promoted. This process typically takes 30-60 seconds, depending on your etcd and PostgreSQL configurations.
After a successful failover, verify the new primary using the REST API or etcd. You can then restart Patroni on the original primary node; it should rejoin the cluster as a replica.
Architecting WooCommerce for High Availability with PostgreSQL
Connecting WooCommerce to a highly available PostgreSQL cluster requires a slight shift in how your application connects. Instead of a direct IP to a single database server, WooCommerce needs to connect to a virtual IP or a load balancer that always points to the *current* primary PostgreSQL instance.
Using a Virtual IP (VIP) with Keepalived
A common and effective method is to use Keepalived. Keepalived provides a Virtual Router Redundancy Protocol (VRRP) implementation, allowing a floating IP address to be managed by a cluster of servers. In our case, this cluster will be the PostgreSQL nodes themselves.
Install Keepalived on each PostgreSQL node:
sudo apt update sudo apt install keepalived
Configure Keepalived. The configuration file is typically `/etc/keepalived/keepalived.conf`. You’ll need to define a VRRP instance that manages the VIP. This configuration should be identical on all PostgreSQL nodes, with the exception of the `priority` and `state` parameters.
# /etc/keepalived/keepalived.conf on PostgreSQL Node 1 (Primary Candidate)
vrrp_script pg_check {
script "/usr/local/bin/check_pg_primary.sh"
interval 2
weight 20
fall 2
rise 2
}
vrrp_instance VI_1 {
state MASTER # Set to BACKUP on other nodes
interface eth0 # Adjust to your primary network interface
virtual_router_id 51
priority 150 # Higher priority for MASTER, lower for BACKUP
advert_int 1
authentication {
auth_type PASS
auth_pass your_vrrp_password
}
virtual_ipaddress {
/24 dev eth0 # The floating IP for your database
}
track_script {
pg_check
}
}
Explanation:
vrrp_script pg_check: This defines a script that Keepalived will run periodically to check the health of the PostgreSQL primary.vrrp_instance VI_1: Defines the VRRP instance.state MASTER/BACKUP: Determines which node initially owns the VIP.priority: Higher value means higher priority. The node with the highest priority and a healthy check script will own the VIP.virtual_router_id: A unique identifier for this VRRP group.virtual_ipaddress: The floating IP address that applications will connect to.track_script: Tells Keepalived to monitor the health of the PostgreSQL primary via the script.
Create the health check script (`/usr/local/bin/check_pg_primary.sh`):
#!/bin/bash
# Check if this node is the primary according to Patroni
PATRONI_API_URL="http://localhost:8008"
IS_PRIMARY=$(curl -s "$PATRONI_API_URL/primary" | jq -r '.state // "unknown"')
if [ "$IS_PRIMARY" == "true" ]; then
# Check if PostgreSQL is actually running and accepting connections
PG_PORT=$(grep -A 1 'listen:' /etc/patroni/patroni.yml | tail -n 1 | awk '{print $2}')
PG_PORT=${PG_PORT:-5432} # Default to 5432 if not found
pg_isready -h 127.0.0.1 -p $PG_PORT -q
if [ $? -eq 0 ]; then
exit 0 # PostgreSQL is primary and healthy
else
exit 1 # PostgreSQL is not healthy
fi
else
exit 1 # Not the primary
fi
Make the script executable:
sudo chmod +x /usr/local/bin/check_pg_primary.sh
Ensure `jq` is installed for parsing JSON output from `curl`.
Restart Keepalived on all PostgreSQL nodes:
sudo systemctl restart keepalived
Verify that the VIP is assigned to the current primary node. You can check this using `ip addr show eth0` (replace `eth0` with your interface). When the primary fails, Keepalived should automatically move the VIP to a healthy replica.
Configuring WooCommerce Database Connection
In your WooCommerce `wp-config.php` file, update the database connection details to use the virtual IP address:
define( 'DB_HOST', ':5432' ); define( 'DB_NAME', 'your_database_name' ); define( 'DB_USER', 'your_woocommerce_db_user' ); define( 'DB_PASSWORD', 'your_woocommerce_db_password' ); define( 'DB_CHARSET', 'utf8mb4' ); define( 'DB_COLLATE', '' );
WooCommerce will now connect to the VIP. When a failover occurs, and the VIP moves to a new primary, WooCommerce will automatically connect to the new primary without any application-level changes, provided the database user and password remain the same.
Alternative: Load Balancer Approach
For more complex scenarios or if you prefer a dedicated load balancing solution, you could place a load balancer (e.g., HAProxy, Nginx Plus, or a cloud provider’s managed load balancer) in front of your PostgreSQL cluster. The load balancer would need to be configured to:
- Monitor the health of PostgreSQL instances.
- Direct traffic only to the current primary.
- Update its routing rules dynamically as Patroni promotes new primaries.
This often involves integrating the load balancer with Patroni’s API or using a custom script that polls Patroni and reconfigures the load balancer. While more flexible, it adds another layer of complexity and potential failure points compared to the Keepalived VIP approach for simpler setups.
Monitoring and Maintenance
A robust HA setup requires continuous monitoring. Key metrics to track include:
- PostgreSQL replication lag.
- Patroni and etcd cluster health.
- Keepalived status and VIP ownership.
- Resource utilization (CPU, memory, disk I/O) on all nodes.
- Network latency between nodes.
Tools like Prometheus with PostgreSQL and etcd exporters, along with Grafana for visualization, are invaluable. Set up alerts for any deviations from normal operating parameters.
Regularly test your failover procedures. Schedule maintenance windows to perform manual failovers, test upgrades, and ensure your disaster recovery plan remains effective. Documenting the entire process, including rollback procedures, is critical for operational readiness.