Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Magento 2 Deployments on DigitalOcean
Establishing a Highly Available PostgreSQL Cluster
Achieving automated failover for PostgreSQL on DigitalOcean necessitates a robust, multi-node architecture. We’ll leverage a primary-replica setup with synchronous replication to minimize data loss during an outage. For automated failover detection and execution, we’ll integrate a tool like Patroni, which orchestrates PostgreSQL instances and manages cluster state using etcd or Consul for distributed consensus.
Setting up Patroni and etcd
First, provision at least three DigitalOcean Droplets. One will host etcd (for consensus), and the others will host PostgreSQL instances (one primary, two replicas). Ensure these Droplets are in the same VPC network for low latency.
Install etcd on its dedicated Droplet. For simplicity, we’ll use a single-node etcd for this example, but a clustered etcd setup is recommended for production resilience.
etcd Installation (Ubuntu/Debian)
wget -q --show-progress --https-only --િ-certificate-check \ https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz tar -xvf etcd-v3.5.9-linux-amd64.tar.gz sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/ sudo mkdir -p /etc/etcd /var/lib/etcd sudo chown -R etcd:etcd /var/lib/etcd # Create etcd systemd service file sudo nano /etc/systemd/system/etcd.service
Paste the following content into /etc/systemd/system/etcd.service:
[Unit] Description=etcd key-value store Documentation=https://github.com/etcd-io/etcd After=network.target [Service] User=etcd ExecStart=/usr/local/bin/etcd \ --name=etcd0 \ --data-dir=/var/lib/etcd \ --listen-client-urls http://0.0.0.0:2379 \ --advertise-client-urls http://YOUR_ETCD_DROPLET_IP:2379 \ --listen-peer-urls http://0.0.0.0:2380 \ --initial-advertise-peer-urls http://YOUR_ETCD_DROPLET_IP:2380 \ --initial-cluster etcd0=http://YOUR_ETCD_DROPLET_IP:2380 \ --initial-cluster-state new \ --enable-pprof \ --proxy=on [Install] WantedBy=multi-user.target
Replace YOUR_ETCD_DROPLET_IP with the actual IP address of your etcd Droplet. Then, enable and start the etcd service:
sudo systemctl daemon-reload sudo systemctl enable etcd sudo systemctl start etcd sudo systemctl status etcd
Patroni Installation and Configuration
Install Patroni and its dependencies (Python 3, pip, PostgreSQL development headers) on each PostgreSQL Droplet.
sudo apt update sudo apt install -y python3 python3-pip postgresql-server-dev-14 # Adjust version as needed sudo pip3 install 'patroni[etcd]' psycopg2-binary
Create a Patroni configuration file (e.g., patroni.yml) on each PostgreSQL Droplet. This configuration defines the PostgreSQL settings, replication method, and etcd connection details.
# patroni.yml
scope: my_magento_cluster
namespace: /service/ # etcd namespace for this cluster
restapi:
listen: 0.0.0.0:8008
connect_address: YOUR_POSTGRES_DROPLET_IP:8008
etcd:
host: YOUR_ETCD_DROPLET_IP:2379
protocol: http
postgresql:
listen: 0.0.0.0:5432
connect_address: YOUR_POSTGRES_DROPLET_IP:5432
data_dir: /var/lib/postgresql/14/main # Adjust path as per your PostgreSQL installation
pg_hba:
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
replication:
username: replicator
password: YOUR_REPLICATION_PASSWORD
network: 0.0.0.0/0
parameters:
max_connections: 100
shared_buffers: 256MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
max_replication_slots: 5
synchronous_commit: on
synchronous_standby_names: "my_magento_cluster.replica1,my_magento_cluster.replica2" # Names must match Patroni scope and replica names
# Ensure these are set to 'on' for synchronous replication
wal_sender_timeout: 0
wal_receiver_timeout: 0
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB, adjust as needed
postgresql:
use_pg_rewind: true
use_slots: true
Replace YOUR_POSTGRES_DROPLET_IP with the IP of the current Droplet and YOUR_ETCD_DROPLET_IP with the etcd Droplet’s IP. Also, set a strong password for the replicator user. The synchronous_standby_names parameter is crucial for synchronous replication; ensure the replica names match the scope and a suffix like .replica1.
Starting PostgreSQL with Patroni
Create a systemd service file for Patroni on each PostgreSQL Droplet (e.g., patroni.service).
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
Ensure the User and Group match your PostgreSQL installation. Place the patroni.yml file in /etc/patroni/. Then, enable and start the Patroni service:
sudo systemctl daemon-reload sudo systemctl enable patroni sudo systemctl start patroni sudo systemctl status patroni
Patroni will automatically initialize the first PostgreSQL instance as the primary and configure the others as replicas. Monitor the logs for any errors. You can verify the cluster status via the Patroni REST API or by querying etcd.
Magento 2 Application Layer High Availability
For Magento 2, high availability at the application layer involves ensuring that multiple web servers can serve traffic and that session data, cache, and file storage are shared or replicated. We’ll use a load balancer, shared file storage, and a distributed session/cache backend.
Load Balancing with HAProxy
Deploy at least two Magento web servers. A third Droplet can host HAProxy for load balancing. Configure HAProxy to distribute traffic across the Magento web servers.
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend http_frontend
bind *:80
acl is_magento_health_check method GET_URL /health_check.php # Custom health check endpoint
http-request deny if !is_magento_health_check # Deny all except health check for health check endpoint
use_backend magento_backend if is_magento_health_check
default_backend magento_backend
backend magento_backend
balance roundrobin
option httpchk GET /health_check.php # Configure health check
server magento1 YOUR_MAGENTO_SERVER_1_IP:80 check
server magento2 YOUR_MAGENTO_SERVER_2_IP:80 check
# Add more servers as needed
Create a simple health_check.php file in the Magento root directory on each web server:
<?php // health_check.php echo "OK"; exit(0); ?>
Install and configure HAProxy, then enable and start the service.
sudo apt update sudo apt install -y haproxy sudo systemctl enable haproxy sudo systemctl start haproxy
Shared Storage for Magento Files
Magento’s pub/media and var/generation directories need to be accessible by all web servers. DigitalOcean Spaces (S3-compatible) or a managed NFS service can be used. For simplicity and performance, we’ll outline using a DigitalOcean Spaces bucket with the aws-sdk-php and a custom filesystem adapter.
Configuring DigitalOcean Spaces
Create a DigitalOcean Space and generate API credentials (Key and Secret).
Magento Filesystem Configuration
Install the AWS SDK for PHP and the Flysystem S3 adapter.
composer require aws/aws-sdk-php league/flysystem-aws-s3-v3
Create a custom filesystem configuration in app/etc/env.php or a custom configuration file. This example assumes you’re using the default Magento filesystem configuration and want to override the media storage.
// app/etc/env.php (example snippet)
'filesystem' => [
'allowed_paths' => [
'/var/www/html/pub/media', // Ensure this path is accessible locally for CLI operations
],
'media' => [
'default' => 's3_magento',
],
's3_magento' => [
'driver' => 's3',
'key' => 'YOUR_SPACES_KEY',
'secret' => 'YOUR_SPACES_SECRET',
'region' => 'nyc3', // Your Spaces region
'bucket' => 'your-magento-media-bucket',
'endpoint' => 'https://nyc3.digitaloceanspaces.com', // Your Spaces endpoint
'url' => 'https://your-magento-media-bucket.nyc3.digitaloceanspaces.com', // Public URL for media
'visibility' => 'public_read',
],
],
After configuring, you’ll need to run Magento CLI commands to synchronize existing media files to the bucket and potentially clear caches.
php bin/magento setup:upgrade php bin/magento setup:di:compile php bin/magento setup:static-content:deploy -f php bin/magento cache:clean php bin/magento cache:flush # For initial sync, you might need a custom script or a tool like rclone
Distributed Session and Cache Management
Magento’s session and cache data must be shared across web servers. Redis is an excellent choice for this. Deploy a Redis cluster or a highly available Redis setup.
Redis Setup
Install Redis on a separate Droplet or on one of the existing Droplets if resources permit. For high availability, consider Redis Sentinel or Redis Cluster.
sudo apt update sudo apt install -y redis-server
Configure Redis to listen on a private IP address and adjust memory limits as needed.
# /etc/redis/redis.conf bind 127.0.0.1 YOUR_REDIS_DROPLET_IP # Bind to private IP protected-mode no # If using firewall rules, otherwise keep enabled # maxmemory 256mb # Adjust as needed # maxmemory-policy allkeys-lru
Restart Redis after configuration changes.
sudo systemctl restart redis-server
Magento Redis Configuration
Update app/etc/env.php to use Redis for sessions and caching.
// app/etc/env.php (example snippet)
'cache' => [
'frontend' => [
'default' => [
'backend' => 'Magento\\Framework\\Cache\\Backend\\Redis',
'options' => [
'server' => 'YOUR_REDIS_DROPLET_IP',
'port' => 6379,
'database' => 0, // For default cache
'password' => '', // If Redis is password protected
],
],
'page_cache' => [
'backend' => 'Magento\\Framework\\Cache\\Backend\\Redis',
'options' => [
'server' => 'YOUR_REDIS_DROPLET_IP',
'port' => 6379,
'database' => 1, // For page cache
'password' => '',
],
],
],
],
'session' => [
'save' => 'redis',
'redis' => [
'host' => 'YOUR_REDIS_DROPLET_IP',
'port' => 6379,
'password' => '',
'timeout' => 2.5,
'persistent_identifier' => '',
'database' => 2, // For session data
'compression_threshold' => 2048,
'compression_library' => 'gzip',
'log_level' => 0,
],
],
Ensure your Magento web servers can reach the Redis Droplet on port 6379. After updating env.php, clear Magento caches.
Automated Failover Orchestration
The core of automated failover lies in the interplay between PostgreSQL’s high availability managed by Patroni and the application’s ability to adapt to changes. HAProxy’s health checks are critical for redirecting traffic away from unhealthy Magento instances. For PostgreSQL, Patroni handles the promotion of a replica to primary automatically.
PostgreSQL Failover Workflow
1. **Primary Node Failure:** If the primary PostgreSQL node becomes unreachable (e.g., network issue, crash), Patroni detects this through its monitoring of the PostgreSQL process and its DCS (etcd) lease.
2. **Leader Election:** Patroni instances on the remaining replica nodes attempt to acquire a leader lock in etcd. The first one to successfully acquire the lock becomes the new leader.
3. **Replica Promotion:** The new leader node initiates the promotion of one of its replicas (or itself if it was already a replica) to become the new primary. This involves stopping replication and reconfiguring PostgreSQL.
4. **Application Reconfiguration (Implicit):** Magento applications configured to connect to the PostgreSQL cluster via a Virtual IP (VIP) or a DNS entry managed by a service that updates on Patroni’s leader change will automatically connect to the new primary. If using a direct connection string, this needs to be updated. A common pattern is to have a load balancer or proxy in front of PostgreSQL that directs traffic to the current primary. Patroni’s REST API can be queried to determine the current primary, and this information can be used to update the VIP or DNS.
Magento Application Failover Workflow
1. **Magento Node Failure:** If a Magento web server becomes unresponsive, HAProxy’s health checks (e.g., checking /health_check.php) will fail.
2. **Traffic Redirection:** HAProxy automatically stops sending traffic to the unhealthy node and redirects all requests to the remaining healthy Magento web servers.
3. **Session Persistence:** Because sessions are stored in Redis, users will not lose their session state when traffic is redirected to a different web server.
Integrating PostgreSQL and Magento Failover
The most seamless integration involves abstracting the PostgreSQL primary endpoint. This can be achieved by:
- Virtual IP (VIP): Using a tool like Keepalived to manage a floating IP address that always points to the current primary PostgreSQL instance. Patroni can be configured to signal Keepalived on leader changes.
- DNS-based Failover: Employing a DNS service that can be updated programmatically. Patroni’s API can trigger DNS record updates.
- Proxy Layer: Implementing a dedicated PostgreSQL proxy (e.g., ProxySQL) that can be dynamically configured to point to the active primary.
For Magento, the HAProxy health checks ensure that traffic is only sent to healthy application instances. When a PostgreSQL failover occurs, the application needs to connect to the new primary. If using a VIP or dynamic DNS, this transition is largely transparent to the application. If Magento is configured with a static IP for the database, a mechanism must exist to update this configuration upon PostgreSQL failover.
Monitoring and Alerting
Robust monitoring is paramount. Implement checks for:
- PostgreSQL cluster health (via Patroni API or etcd health).
- Replication lag.
- HAProxy backend health.
- Redis cluster health and memory usage.
- Disk space on all Droplets.
- Application error rates and response times.
Tools like Prometheus with Alertmanager, Datadog, or DigitalOcean’s own monitoring can be leveraged. Configure alerts for critical events such as a PostgreSQL primary failure, HAProxy backend failures, or significant replication lag.