Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and WordPress Deployments on DigitalOcean
Establishing a Highly Available PostgreSQL Cluster with Patroni
Achieving true disaster recovery for a critical application like WordPress hinges on a robust, self-healing database layer. For PostgreSQL, the de facto standard for high-availability (HA) and automated failover is Patroni. Patroni leverages a distributed consensus store (like etcd, Consul, or ZooKeeper) to manage cluster state, leader election, and failover orchestration. We’ll focus on etcd for this example due to its widespread adoption and ease of deployment on DigitalOcean.
Our goal is to set up a PostgreSQL cluster with at least two read-write nodes and one or more read-only replicas. Patroni will monitor the primary node and automatically promote a replica if the primary becomes unavailable.
Deploying etcd Cluster
A minimum of three etcd nodes is recommended for a production-ready distributed consensus store. Deploy these as separate Droplets on DigitalOcean. Ensure they can communicate with each other over a private network for optimal performance and security.
On each etcd node, install etcd. For Ubuntu, this typically involves downloading the binary or using a package manager.
Example etcd Configuration (etcd1.example.com)
Create an etcd configuration file (e.g., /etc/etcd/etcd.conf.yml) on each etcd node. The configuration will differ slightly for each node, primarily in the --initial-advertise-peer-urls and --listen-peer-urls parameters.
name: etcd1 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://:2379 listen-peer-urls: http:// :2380 initial-advertise-peer-urls: http:// :2380 initial-cluster: etcd1=http:// :2380,etcd2=http:// :2380,etcd3=http:// :2380 initial-cluster-token: my-etcd-cluster-token initial-cluster-state: new
Start and enable the etcd service:
sudo systemctl start etcd sudo systemctl enable etcd
Verify the etcd cluster health:
ETCDCTL_API=3 etcdctl member list ETCDCTL_API=3 etcdctl endpoint health
Deploying PostgreSQL with Patroni
On each of your PostgreSQL Droplets, install PostgreSQL and Patroni. Ensure these Droplets are also on the same private network.
Install PostgreSQL (e.g., version 14):
sudo apt update sudo apt install postgresql postgresql-contrib
Install Patroni and its dependencies (e.g., python3-pip, python3-venv):
sudo apt install python3-pip python3-venv pip3 install --upgrade pip pip3 install patroni[etcd] psycopg2-binary
Patroni Configuration
Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration defines how Patroni interacts with PostgreSQL and etcd.
# /etc/patroni/patroni.yml scope: wordpress_db_cluster # Unique name for this cluster namespace: /service/ # Base path in etcd for this cluster etcd: host::2379 # Point to one of your etcd nodes protocol: http postgresql: listen: 0.0.0.0:5432 connect_address: :5432 # This will be specific to each node data_dir: /var/lib/postgresql/14/main pg_hba: - host all all 0.0.0.0/0 md5 parameters: max_connections: 100 shared_buffers: 256MB effective_cache_size: 768MB maintenance_work_mem: 64MB wal_level: replica wal_sync_method: fsync wal_writer_delay: 200ms wal_buffers: 16MB backend_flush_after: 256kB checkpoint_timeout: 5min max_wal_size: 1GB default_statistics_target: 100 random_page_cost: 1.1 effective_io_concurrency: 200 work_mem: 16MB min_wal_size: 80MB max_worker_processes: 8 shared_preload_libraries: pg_stat_statements auto_explain.log_min_duration: 250ms log_line_prefix: '%t [%p]: ' log_statement: 'ddl' replication: username: replicator password: your_replication_password network: 10.0.0.0/8 # Adjust to your private network CIDR bootstrap: method: pg_basebackup initdb: - encoding: UTF8 - locale: en_US.UTF-8 - data-checksums pg_basebackup: command: pg_basebackup -D /var/lib/postgresql/14/main -h %H -p %P -U %u -v -P -R recovery_conf: standby_mode: 'on' primary_conninfo: 'host=%H port=%P user=%u password=%p sslmode=prefer' restore_command: 'cp /path/to/wal/%f %p' # Placeholder, will be managed by Patroni tags: nofailover: false clonefrom: false
Important Notes on Configuration:
- Replace
<NODE_PRIVATE_IP>with the actual private IP of the Droplet. - The
etcd.hostshould point to one of your etcd nodes. Patroni will discover the rest. - The
pg_hbaconfiguration allows all connections. In production, restrict this to your WordPress application servers. wal_levelmust be set toreplica.- The
replicationsection defines credentials for replication. Ensure these are strong and unique. - The
bootstrap.pg_basebackup.recovery_conf.restore_commandis a placeholder. Patroni manages WAL archiving and retrieval for replicas. tags.nofailover: falseandtags.clonefrom: falseare crucial for enabling automatic failover and cloning.
Systemd Service for Patroni
Create a systemd service file (e.g., /etc/systemd/system/patroni.service) to manage the Patroni process.
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure KillMode=process TimeoutSec=30 [Install] WantedBy=multi-user.target
Start and enable the Patroni service on all PostgreSQL nodes:
sudo systemctl daemon-reload sudo systemctl start patroni sudo systemctl enable patroni
Initiating the Cluster and Verifying HA
When Patroni starts on the first node, it will attempt to bootstrap a new PostgreSQL cluster. Subsequent nodes will join as replicas. You can monitor the process via Patroni logs (e.g., journalctl -u patroni -f).
To verify the cluster state, use patronictl:
patronictl -c /etc/patroni/patroni.yml list
The output should show your cluster, the primary node, and any replicas. For example:
+ Cluster: wordpress_db_cluster (XXXXXXXXXXXXXX) + Members: + node1.example.com (XXXXXXXXXXXXXX) - Leader + node2.example.com (XXXXXXXXXXXXXX) - Replica + node3.example.com (XXXXXXXXXXXXXX) - Replica
Simulating a Failover
To test the failover mechanism, stop the Patroni service on the current leader node:
sudo systemctl stop patroni
Observe the patronictl list output. Within a short period (typically 30-60 seconds, depending on etcd and Patroni timeouts), one of the replicas should be promoted to primary. You can also simulate a node failure by shutting down the PostgreSQL Droplet.
Once a new primary is elected, restart the Patroni service on the original leader node. Patroni will detect that it’s no longer the leader and reconfigure itself as a replica of the new primary.
Architecting WordPress for High Availability with PostgreSQL
WordPress itself is stateless, making it relatively easy to achieve HA. The primary challenge lies in ensuring that all WordPress instances can connect to the *current* primary PostgreSQL instance, even after a failover. This requires a mechanism to abstract the database endpoint.
Database Connection Abstraction
Several strategies can be employed:
- DNS-based Failover: Use a DNS provider that supports health checks and automated record updates. Point your WordPress
wp-config.phpto a DNS name that resolves to the current primary’s IP. This is often the simplest approach but can have DNS propagation delays. - Load Balancer with Health Checks: Deploy a load balancer (e.g., HAProxy, Nginx Plus, or DigitalOcean’s Managed Load Balancer) in front of your PostgreSQL cluster. Configure it to monitor the health of the primary node and automatically direct traffic to a healthy replica when the primary fails.
- Virtual IP (VIP) with Keepalived: A more traditional approach involves using Keepalived to manage a floating IP address that is always assigned to the current primary PostgreSQL server.
For this guide, we’ll focus on the Load Balancer approach using HAProxy, as it offers more flexibility and control.
Deploying HAProxy for PostgreSQL Load Balancing
Deploy a dedicated Droplet for HAProxy. Ensure it has network connectivity to all your PostgreSQL nodes.
Install HAProxy:
sudo apt update sudo apt install haproxy
HAProxy Configuration
Edit the HAProxy configuration file (/etc/haproxy/haproxy.cfg).
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend pgsql_frontend
bind *:5432
mode tcp
default_backend pgsql_backend
backend pgsql_backend
mode tcp
balance roundrobin
option httpchk GET /
# Health check for PostgreSQL. Patroni exposes a health check endpoint.
# We'll use a simple TCP check here, but a more robust check could query Patroni's API.
option pgsql-check user postgres # Use a dedicated replication user with minimal privileges
# Replace with your PostgreSQL node private IPs and ports
server pg_node1 :5432 check port 5432 inter 2s fall 3 rise 2
server pg_node2 :5432 check port 5432 inter 2s fall 3 rise 2
server pg_node3 :5432 check port 5432 inter 2s fall 3 rise 2
listen stats
bind *:8404
mode http
stats enable
stats uri /stats
stats realm Haproxy\ Statistics
stats auth admin:your_haproxy_stats_password
Explanation:
- The
frontend pgsql_frontendlistens on port 5432 and forwards traffic to thepgsql_backend. - The
backend pgsql_backendusesbalance roundrobin. While round-robin is used for initial distribution, HAProxy’s health checks are key. option pgsql-check user postgresattempts a basic PostgreSQL connection. For more advanced checks, you could integrate with Patroni’s REST API to determine the leader.check port 5432 inter 2s fall 3 rise 2configures health checks. HAProxy will ping each server every 2 seconds, consider it down after 3 failures, and back up after 2 successes.- The
listen statssection provides a web interface for monitoring HAProxy’s status.
Restart and enable HAProxy:
sudo systemctl restart haproxy sudo systemctl enable haproxy
Configuring WordPress
Update your WordPress wp-config.php file to point to the HAProxy IP address and port.
<?php // ... other WordPress configurations define( 'DB_HOST', ':5432' ); // Use the IP of your HAProxy Droplet define( 'DB_USER', 'wordpress_user' ); define( 'DB_PASSWORD', 'your_wordpress_db_password' ); define( 'DB_NAME', 'wordpress_db' ); // ... rest of wp-config.php ?>
Ensure that the WordPress database user (wordpress_user in this example) exists and has the necessary privileges on the PostgreSQL database.
Deploying WordPress Application Servers
For WordPress itself, you’ll want multiple web server instances (e.g., Nginx or Apache) running your WordPress application. These should be configured to serve your site and connect to the HAProxy endpoint for database access.
Web Server Configuration (Nginx Example)
On each WordPress web server Droplet, configure Nginx to serve your WordPress files. The key is that all these instances share the same database configuration pointing to HAProxy.
server {
listen 80;
server_name yourdomain.com www.yourdomain.com;
root /var/www/html/your-wordpress-site; # Path to your WordPress installation
index index.php index.html index.htm;
location / {
try_files $uri $uri/ /index.php?$args;
}
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/var/run/php/php7.4-fpm.sock; # Adjust PHP-FPM version
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include fastcgi_params;
}
# Deny access to sensitive files
location ~ /\.ht {
deny all;
}
# Add caching headers for static assets if desired
location ~* \.(css|js|jpg|jpeg|png|gif|ico|svg|webp)$ {
expires 1y;
add_header Cache-Control "public";
}
}
Ensure PHP-FPM is installed and configured correctly.
Load Balancing WordPress Instances
To handle incoming web traffic and distribute it across your WordPress web servers, deploy another load balancer (e.g., DigitalOcean’s Managed Load Balancer or another HAProxy instance). This load balancer will direct traffic to your Nginx/Apache Droplets.
This setup ensures that if one WordPress web server fails, traffic is automatically routed to the remaining healthy instances. The database connection remains stable via the PostgreSQL HAProxy endpoint.
Monitoring and Maintenance
A robust disaster recovery strategy includes continuous monitoring. Key areas to monitor:
- etcd Cluster Health: Ensure all etcd members are healthy and responsive.
- Patroni Status: Monitor Patroni logs for any errors or warnings related to cluster management or PostgreSQL. Use
patronictl listregularly. - PostgreSQL Replication Lag: While Patroni manages failover, significant replication lag can still impact user experience. Monitor
pg_stat_replicationon the primary. - HAProxy Health Checks: Monitor the HAProxy stats page for backend server status.
- Application Logs: Keep an eye on WordPress and web server logs for any unusual activity or errors.
- DigitalOcean Monitoring: Utilize DigitalOcean’s built-in Droplet and resource monitoring for CPU, memory, disk I/O, and network traffic.
Automated Backups
While HA provides resilience against node failures, it does not protect against data corruption or accidental deletion. Implement a robust backup strategy for your PostgreSQL database. Tools like pg_dump, pg_basebackup with WAL archiving, or managed backup solutions can be used. Ensure backups are stored off-site or in a separate availability zone.
For WordPress files, regular backups of your web root directory are also essential.
Conclusion
By combining Patroni for PostgreSQL high availability and automated failover with a load-balanced WordPress deployment, you can architect a resilient system on DigitalOcean. This setup minimizes downtime and ensures your application remains available even in the face of infrastructure failures. Continuous monitoring and regular testing of your failover procedures are paramount to maintaining confidence in your disaster recovery capabilities.