Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WordPress Deployments on DigitalOcean

Establishing a Highly Available PostgreSQL Cluster with Patroni

Achieving true disaster recovery for a critical application like WordPress hinges on a robust, self-healing database layer. For PostgreSQL, the de facto standard for high-availability (HA) and automated failover is Patroni. Patroni leverages a distributed consensus store (like etcd, Consul, or ZooKeeper) to manage cluster state, leader election, and failover orchestration. We’ll focus on etcd for this example due to its widespread adoption and ease of deployment on DigitalOcean.

Our goal is to set up a PostgreSQL cluster with at least two read-write nodes and one or more read-only replicas. Patroni will monitor the primary node and automatically promote a replica if the primary becomes unavailable.

Deploying etcd Cluster

A minimum of three etcd nodes is recommended for a production-ready distributed consensus store. Deploy these as separate Droplets on DigitalOcean. Ensure they can communicate with each other over a private network for optimal performance and security.

On each etcd node, install etcd. For Ubuntu, this typically involves downloading the binary or using a package manager.

Example etcd Configuration (etcd1.example.com)

Create an etcd configuration file (e.g., /etc/etcd/etcd.conf.yml) on each etcd node. The configuration will differ slightly for each node, primarily in the --initial-advertise-peer-urls and --listen-peer-urls parameters.

name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://:2379
listen-peer-urls: http://:2380
initial-advertise-peer-urls: http://:2380
initial-cluster: etcd1=http://:2380,etcd2=http://:2380,etcd3=http://:2380
initial-cluster-token: my-etcd-cluster-token
initial-cluster-state: new

Start and enable the etcd service:

sudo systemctl start etcd
sudo systemctl enable etcd

Verify the etcd cluster health:

ETCDCTL_API=3 etcdctl member list
ETCDCTL_API=3 etcdctl endpoint health

Deploying PostgreSQL with Patroni

On each of your PostgreSQL Droplets, install PostgreSQL and Patroni. Ensure these Droplets are also on the same private network.

Install PostgreSQL (e.g., version 14):

sudo apt update
sudo apt install postgresql postgresql-contrib

Install Patroni and its dependencies (e.g., python3-pip, python3-venv):

sudo apt install python3-pip python3-venv
pip3 install --upgrade pip
pip3 install patroni[etcd] psycopg2-binary

Patroni Configuration

Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration defines how Patroni interacts with PostgreSQL and etcd.

# /etc/patroni/patroni.yml
scope: wordpress_db_cluster # Unique name for this cluster
namespace: /service/ # Base path in etcd for this cluster

etcd:
  host: :2379 # Point to one of your etcd nodes
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: :5432 # This will be specific to each node
  data_dir: /var/lib/postgresql/14/main
  pg_hba:
    - host    all   all   0.0.0.0/0   md5
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    effective_cache_size: 768MB
    maintenance_work_mem: 64MB
    wal_level: replica
    wal_sync_method: fsync
    wal_writer_delay: 200ms
    wal_buffers: 16MB
    backend_flush_after: 256kB
    checkpoint_timeout: 5min
    max_wal_size: 1GB
    default_statistics_target: 100
    random_page_cost: 1.1
    effective_io_concurrency: 200
    work_mem: 16MB
    min_wal_size: 80MB
    max_worker_processes: 8
    shared_preload_libraries: pg_stat_statements
    auto_explain.log_min_duration: 250ms
    log_line_prefix: '%t [%p]: '
    log_statement: 'ddl'

replication:
  username: replicator
  password: your_replication_password
  network: 10.0.0.0/8 # Adjust to your private network CIDR

bootstrap:
  method: pg_basebackup
  initdb:
    - encoding: UTF8
    - locale: en_US.UTF-8
    - data-checksums
  pg_basebackup:
    command: pg_basebackup -D /var/lib/postgresql/14/main -h %H -p %P -U %u -v -P -R
    recovery_conf:
      standby_mode: 'on'
      primary_conninfo: 'host=%H port=%P user=%u password=%p sslmode=prefer'
      restore_command: 'cp /path/to/wal/%f %p' # Placeholder, will be managed by Patroni

tags:
  nofailover: false
  clonefrom: false

Important Notes on Configuration:

Replace <NODE_PRIVATE_IP> with the actual private IP of the Droplet.
The etcd.host should point to one of your etcd nodes. Patroni will discover the rest.
The pg_hba configuration allows all connections. In production, restrict this to your WordPress application servers.
wal_level must be set to replica.
The replication section defines credentials for replication. Ensure these are strong and unique.
The bootstrap.pg_basebackup.recovery_conf.restore_command is a placeholder. Patroni manages WAL archiving and retrieval for replicas.
tags.nofailover: false and tags.clonefrom: false are crucial for enabling automatic failover and cloning.

Systemd Service for Patroni

Create a systemd service file (e.g., /etc/systemd/system/patroni.service) to manage the Patroni process.

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
KillMode=process
TimeoutSec=30

[Install]
WantedBy=multi-user.target

Start and enable the Patroni service on all PostgreSQL nodes:

sudo systemctl daemon-reload
sudo systemctl start patroni
sudo systemctl enable patroni

Initiating the Cluster and Verifying HA

When Patroni starts on the first node, it will attempt to bootstrap a new PostgreSQL cluster. Subsequent nodes will join as replicas. You can monitor the process via Patroni logs (e.g., journalctl -u patroni -f).

To verify the cluster state, use patronictl:

patronictl -c /etc/patroni/patroni.yml list

The output should show your cluster, the primary node, and any replicas. For example:

+ Cluster: wordpress_db_cluster (XXXXXXXXXXXXXX)
+         Members:
+             node1.example.com (XXXXXXXXXXXXXX) - Leader
+             node2.example.com (XXXXXXXXXXXXXX) - Replica
+             node3.example.com (XXXXXXXXXXXXXX) - Replica

Simulating a Failover

To test the failover mechanism, stop the Patroni service on the current leader node:

sudo systemctl stop patroni

Observe the patronictl list output. Within a short period (typically 30-60 seconds, depending on etcd and Patroni timeouts), one of the replicas should be promoted to primary. You can also simulate a node failure by shutting down the PostgreSQL Droplet.

Once a new primary is elected, restart the Patroni service on the original leader node. Patroni will detect that it’s no longer the leader and reconfigure itself as a replica of the new primary.

Architecting WordPress for High Availability with PostgreSQL

WordPress itself is stateless, making it relatively easy to achieve HA. The primary challenge lies in ensuring that all WordPress instances can connect to the *current* primary PostgreSQL instance, even after a failover. This requires a mechanism to abstract the database endpoint.

Database Connection Abstraction

Several strategies can be employed:

DNS-based Failover: Use a DNS provider that supports health checks and automated record updates. Point your WordPress wp-config.php to a DNS name that resolves to the current primary’s IP. This is often the simplest approach but can have DNS propagation delays.
Load Balancer with Health Checks: Deploy a load balancer (e.g., HAProxy, Nginx Plus, or DigitalOcean’s Managed Load Balancer) in front of your PostgreSQL cluster. Configure it to monitor the health of the primary node and automatically direct traffic to a healthy replica when the primary fails.
Virtual IP (VIP) with Keepalived: A more traditional approach involves using Keepalived to manage a floating IP address that is always assigned to the current primary PostgreSQL server.

For this guide, we’ll focus on the Load Balancer approach using HAProxy, as it offers more flexibility and control.

Deploying HAProxy for PostgreSQL Load Balancing

Deploy a dedicated Droplet for HAProxy. Ensure it has network connectivity to all your PostgreSQL nodes.

Install HAProxy:

sudo apt update
sudo apt install haproxy

HAProxy Configuration

Edit the HAProxy configuration file (/etc/haproxy/haproxy.cfg).

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    tcp
    option  tcplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000

frontend pgsql_frontend
    bind *:5432
    mode tcp
    default_backend pgsql_backend

backend pgsql_backend
    mode tcp
    balance roundrobin
    option httpchk GET /
    # Health check for PostgreSQL. Patroni exposes a health check endpoint.
    # We'll use a simple TCP check here, but a more robust check could query Patroni's API.
    option pgsql-check user postgres # Use a dedicated replication user with minimal privileges
    # Replace with your PostgreSQL node private IPs and ports
    server pg_node1 :5432 check port 5432 inter 2s fall 3 rise 2
    server pg_node2 :5432 check port 5432 inter 2s fall 3 rise 2
    server pg_node3 :5432 check port 5432 inter 2s fall 3 rise 2

listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /stats
    stats realm Haproxy\ Statistics
    stats auth admin:your_haproxy_stats_password

Explanation:

The frontend pgsql_frontend listens on port 5432 and forwards traffic to the pgsql_backend.
The backend pgsql_backend uses balance roundrobin. While round-robin is used for initial distribution, HAProxy’s health checks are key.
option pgsql-check user postgres attempts a basic PostgreSQL connection. For more advanced checks, you could integrate with Patroni’s REST API to determine the leader.
check port 5432 inter 2s fall 3 rise 2 configures health checks. HAProxy will ping each server every 2 seconds, consider it down after 3 failures, and back up after 2 successes.
The listen stats section provides a web interface for monitoring HAProxy’s status.

Restart and enable HAProxy:

sudo systemctl restart haproxy
sudo systemctl enable haproxy

Configuring WordPress

Update your WordPress wp-config.php file to point to the HAProxy IP address and port.

<?php
// ... other WordPress configurations

define( 'DB_HOST', ':5432' ); // Use the IP of your HAProxy Droplet
define( 'DB_USER', 'wordpress_user' );
define( 'DB_PASSWORD', 'your_wordpress_db_password' );
define( 'DB_NAME', 'wordpress_db' );

// ... rest of wp-config.php
?>

Ensure that the WordPress database user (wordpress_user in this example) exists and has the necessary privileges on the PostgreSQL database.

Deploying WordPress Application Servers

For WordPress itself, you’ll want multiple web server instances (e.g., Nginx or Apache) running your WordPress application. These should be configured to serve your site and connect to the HAProxy endpoint for database access.

Web Server Configuration (Nginx Example)

On each WordPress web server Droplet, configure Nginx to serve your WordPress files. The key is that all these instances share the same database configuration pointing to HAProxy.

server {
    listen 80;
    server_name yourdomain.com www.yourdomain.com;
    root /var/www/html/your-wordpress-site; # Path to your WordPress installation
    index index.php index.html index.htm;

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust PHP-FPM version
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
    }

    # Deny access to sensitive files
    location ~ /\.ht {
        deny all;
    }

    # Add caching headers for static assets if desired
    location ~* \.(css|js|jpg|jpeg|png|gif|ico|svg|webp)$ {
        expires 1y;
        add_header Cache-Control "public";
    }
}

Ensure PHP-FPM is installed and configured correctly.

Load Balancing WordPress Instances

To handle incoming web traffic and distribute it across your WordPress web servers, deploy another load balancer (e.g., DigitalOcean’s Managed Load Balancer or another HAProxy instance). This load balancer will direct traffic to your Nginx/Apache Droplets.

This setup ensures that if one WordPress web server fails, traffic is automatically routed to the remaining healthy instances. The database connection remains stable via the PostgreSQL HAProxy endpoint.

Monitoring and Maintenance

A robust disaster recovery strategy includes continuous monitoring. Key areas to monitor:

etcd Cluster Health: Ensure all etcd members are healthy and responsive.
Patroni Status: Monitor Patroni logs for any errors or warnings related to cluster management or PostgreSQL. Use patronictl list regularly.
PostgreSQL Replication Lag: While Patroni manages failover, significant replication lag can still impact user experience. Monitor pg_stat_replication on the primary.
HAProxy Health Checks: Monitor the HAProxy stats page for backend server status.
Application Logs: Keep an eye on WordPress and web server logs for any unusual activity or errors.
DigitalOcean Monitoring: Utilize DigitalOcean’s built-in Droplet and resource monitoring for CPU, memory, disk I/O, and network traffic.

Automated Backups

While HA provides resilience against node failures, it does not protect against data corruption or accidental deletion. Implement a robust backup strategy for your PostgreSQL database. Tools like pg_dump, pg_basebackup with WAL archiving, or managed backup solutions can be used. Ensure backups are stored off-site or in a separate availability zone.

For WordPress files, regular backups of your web root directory are also essential.

Conclusion

By combining Patroni for PostgreSQL high availability and automated failover with a load-balanced WordPress deployment, you can architect a resilient system on DigitalOcean. This setup minimizes downtime and ensures your application remains available even in the face of infrastructure failures. Continuous monitoring and regular testing of your failover procedures are paramount to maintaining confidence in your disaster recovery capabilities.