Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Magento 2 Deployments on DigitalOcean

Establishing a Highly Available PostgreSQL Cluster

Achieving automated failover for PostgreSQL on DigitalOcean necessitates a robust, multi-node architecture. We’ll leverage a primary-replica setup with synchronous replication to minimize data loss during an outage. For automated failover detection and execution, we’ll integrate a tool like Patroni, which orchestrates PostgreSQL instances and manages cluster state using etcd or Consul for distributed consensus.

Setting up Patroni and etcd

First, provision at least three DigitalOcean Droplets. One will host etcd (for consensus), and the others will host PostgreSQL instances (one primary, two replicas). Ensure these Droplets are in the same VPC network for low latency.

Install etcd on its dedicated Droplet. For simplicity, we’ll use a single-node etcd for this example, but a clustered etcd setup is recommended for production resilience.

etcd Installation (Ubuntu/Debian)

wget -q --show-progress --https-only --િ-certificate-check \
  https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz

tar -xvf etcd-v3.5.9-linux-amd64.tar.gz
sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/

sudo mkdir -p /etc/etcd /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd

# Create etcd systemd service file
sudo nano /etc/systemd/system/etcd.service

Paste the following content into /etc/systemd/system/etcd.service:

[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
User=etcd
ExecStart=/usr/local/bin/etcd \
  --name=etcd0 \
  --data-dir=/var/lib/etcd \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://YOUR_ETCD_DROPLET_IP:2379 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-advertise-peer-urls http://YOUR_ETCD_DROPLET_IP:2380 \
  --initial-cluster etcd0=http://YOUR_ETCD_DROPLET_IP:2380 \
  --initial-cluster-state new \
  --enable-pprof \
  --proxy=on

[Install]
WantedBy=multi-user.target

Replace YOUR_ETCD_DROPLET_IP with the actual IP address of your etcd Droplet. Then, enable and start the etcd service:

sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd
sudo systemctl status etcd

Patroni Installation and Configuration

Install Patroni and its dependencies (Python 3, pip, PostgreSQL development headers) on each PostgreSQL Droplet.

sudo apt update
sudo apt install -y python3 python3-pip postgresql-server-dev-14 # Adjust version as needed
sudo pip3 install 'patroni[etcd]' psycopg2-binary

Create a Patroni configuration file (e.g., patroni.yml) on each PostgreSQL Droplet. This configuration defines the PostgreSQL settings, replication method, and etcd connection details.

# patroni.yml
scope: my_magento_cluster
namespace: /service/ # etcd namespace for this cluster

restapi:
  listen: 0.0.0.0:8008
  connect_address: YOUR_POSTGRES_DROPLET_IP:8008

etcd:
  host: YOUR_ETCD_DROPLET_IP:2379
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: YOUR_POSTGRES_DROPLET_IP:5432
  data_dir: /var/lib/postgresql/14/main # Adjust path as per your PostgreSQL installation
  pg_hba:
    - host    replication   replicator   0.0.0.0/0   md5
    - host    all           all          0.0.0.0/0   md5
  replication:
    username: replicator
    password: YOUR_REPLICATION_PASSWORD
    network: 0.0.0.0/0
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    max_replication_slots: 5
    synchronous_commit: on
    synchronous_standby_names: "my_magento_cluster.replica1,my_magento_cluster.replica2" # Names must match Patroni scope and replica names
    # Ensure these are set to 'on' for synchronous replication
    wal_sender_timeout: 0
    wal_receiver_timeout: 0

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 1MB, adjust as needed
    postgresql:
      use_pg_rewind: true
      use_slots: true

Replace YOUR_POSTGRES_DROPLET_IP with the IP of the current Droplet and YOUR_ETCD_DROPLET_IP with the etcd Droplet’s IP. Also, set a strong password for the replicator user. The synchronous_standby_names parameter is crucial for synchronous replication; ensure the replica names match the scope and a suffix like .replica1.

Starting PostgreSQL with Patroni

Create a systemd service file for Patroni on each PostgreSQL Droplet (e.g., patroni.service).

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Ensure the User and Group match your PostgreSQL installation. Place the patroni.yml file in /etc/patroni/. Then, enable and start the Patroni service:

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni
sudo systemctl status patroni

Patroni will automatically initialize the first PostgreSQL instance as the primary and configure the others as replicas. Monitor the logs for any errors. You can verify the cluster status via the Patroni REST API or by querying etcd.

Magento 2 Application Layer High Availability

For Magento 2, high availability at the application layer involves ensuring that multiple web servers can serve traffic and that session data, cache, and file storage are shared or replicated. We’ll use a load balancer, shared file storage, and a distributed session/cache backend.

Load Balancing with HAProxy

Deploy at least two Magento web servers. A third Droplet can host HAProxy for load balancing. Configure HAProxy to distribute traffic across the Magento web servers.

# /etc/haproxy/haproxy.cfg
global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend http_frontend
    bind *:80
    acl is_magento_health_check method GET_URL /health_check.php # Custom health check endpoint
    http-request deny if !is_magento_health_check # Deny all except health check for health check endpoint
    use_backend magento_backend if is_magento_health_check

    default_backend magento_backend

backend magento_backend
    balance roundrobin
    option httpchk GET /health_check.php # Configure health check
    server magento1 YOUR_MAGENTO_SERVER_1_IP:80 check
    server magento2 YOUR_MAGENTO_SERVER_2_IP:80 check
    # Add more servers as needed

Create a simple health_check.php file in the Magento root directory on each web server:

<?php
// health_check.php
echo "OK";
exit(0);
?>

Install and configure HAProxy, then enable and start the service.

sudo apt update
sudo apt install -y haproxy
sudo systemctl enable haproxy
sudo systemctl start haproxy

Shared Storage for Magento Files

Magento’s pub/media and var/generation directories need to be accessible by all web servers. DigitalOcean Spaces (S3-compatible) or a managed NFS service can be used. For simplicity and performance, we’ll outline using a DigitalOcean Spaces bucket with the aws-sdk-php and a custom filesystem adapter.

Configuring DigitalOcean Spaces

Create a DigitalOcean Space and generate API credentials (Key and Secret).

Magento Filesystem Configuration

Install the AWS SDK for PHP and the Flysystem S3 adapter.

composer require aws/aws-sdk-php league/flysystem-aws-s3-v3

Create a custom filesystem configuration in app/etc/env.php or a custom configuration file. This example assumes you’re using the default Magento filesystem configuration and want to override the media storage.

// app/etc/env.php (example snippet)
'filesystem' => [
    'allowed_paths' => [
        '/var/www/html/pub/media', // Ensure this path is accessible locally for CLI operations
    ],
    'media' => [
        'default' => 's3_magento',
    ],
    's3_magento' => [
        'driver' => 's3',
        'key'    => 'YOUR_SPACES_KEY',
        'secret' => 'YOUR_SPACES_SECRET',
        'region' => 'nyc3', // Your Spaces region
        'bucket' => 'your-magento-media-bucket',
        'endpoint' => 'https://nyc3.digitaloceanspaces.com', // Your Spaces endpoint
        'url'    => 'https://your-magento-media-bucket.nyc3.digitaloceanspaces.com', // Public URL for media
        'visibility' => 'public_read',
    ],
],

After configuring, you’ll need to run Magento CLI commands to synchronize existing media files to the bucket and potentially clear caches.

php bin/magento setup:upgrade
php bin/magento setup:di:compile
php bin/magento setup:static-content:deploy -f
php bin/magento cache:clean
php bin/magento cache:flush
# For initial sync, you might need a custom script or a tool like rclone

Distributed Session and Cache Management

Magento’s session and cache data must be shared across web servers. Redis is an excellent choice for this. Deploy a Redis cluster or a highly available Redis setup.

Redis Setup

Install Redis on a separate Droplet or on one of the existing Droplets if resources permit. For high availability, consider Redis Sentinel or Redis Cluster.

sudo apt update
sudo apt install -y redis-server

Configure Redis to listen on a private IP address and adjust memory limits as needed.

# /etc/redis/redis.conf
bind 127.0.0.1 YOUR_REDIS_DROPLET_IP # Bind to private IP
protected-mode no # If using firewall rules, otherwise keep enabled
# maxmemory 256mb # Adjust as needed
# maxmemory-policy allkeys-lru

Restart Redis after configuration changes.

sudo systemctl restart redis-server

Magento Redis Configuration

Update app/etc/env.php to use Redis for sessions and caching.

// app/etc/env.php (example snippet)
'cache' => [
    'frontend' => [
        'default' => [
            'backend' => 'Magento\\Framework\\Cache\\Backend\\Redis',
            'options' => [
                'server' => 'YOUR_REDIS_DROPLET_IP',
                'port' => 6379,
                'database' => 0, // For default cache
                'password' => '', // If Redis is password protected
            ],
        ],
        'page_cache' => [
            'backend' => 'Magento\\Framework\\Cache\\Backend\\Redis',
            'options' => [
                'server' => 'YOUR_REDIS_DROPLET_IP',
                'port' => 6379,
                'database' => 1, // For page cache
                'password' => '',
            ],
        ],
    ],
],
'session' => [
    'save' => 'redis',
    'redis' => [
        'host' => 'YOUR_REDIS_DROPLET_IP',
        'port' => 6379,
        'password' => '',
        'timeout' => 2.5,
        'persistent_identifier' => '',
        'database' => 2, // For session data
        'compression_threshold' => 2048,
        'compression_library' => 'gzip',
        'log_level' => 0,
    ],
],

Ensure your Magento web servers can reach the Redis Droplet on port 6379. After updating env.php, clear Magento caches.

Automated Failover Orchestration

The core of automated failover lies in the interplay between PostgreSQL’s high availability managed by Patroni and the application’s ability to adapt to changes. HAProxy’s health checks are critical for redirecting traffic away from unhealthy Magento instances. For PostgreSQL, Patroni handles the promotion of a replica to primary automatically.

PostgreSQL Failover Workflow

1. **Primary Node Failure:** If the primary PostgreSQL node becomes unreachable (e.g., network issue, crash), Patroni detects this through its monitoring of the PostgreSQL process and its DCS (etcd) lease.

2. **Leader Election:** Patroni instances on the remaining replica nodes attempt to acquire a leader lock in etcd. The first one to successfully acquire the lock becomes the new leader.

3. **Replica Promotion:** The new leader node initiates the promotion of one of its replicas (or itself if it was already a replica) to become the new primary. This involves stopping replication and reconfiguring PostgreSQL.

4. **Application Reconfiguration (Implicit):** Magento applications configured to connect to the PostgreSQL cluster via a Virtual IP (VIP) or a DNS entry managed by a service that updates on Patroni’s leader change will automatically connect to the new primary. If using a direct connection string, this needs to be updated. A common pattern is to have a load balancer or proxy in front of PostgreSQL that directs traffic to the current primary. Patroni’s REST API can be queried to determine the current primary, and this information can be used to update the VIP or DNS.

Magento Application Failover Workflow

1. **Magento Node Failure:** If a Magento web server becomes unresponsive, HAProxy’s health checks (e.g., checking /health_check.php) will fail.

2. **Traffic Redirection:** HAProxy automatically stops sending traffic to the unhealthy node and redirects all requests to the remaining healthy Magento web servers.

3. **Session Persistence:** Because sessions are stored in Redis, users will not lose their session state when traffic is redirected to a different web server.

Integrating PostgreSQL and Magento Failover

The most seamless integration involves abstracting the PostgreSQL primary endpoint. This can be achieved by:

Virtual IP (VIP): Using a tool like Keepalived to manage a floating IP address that always points to the current primary PostgreSQL instance. Patroni can be configured to signal Keepalived on leader changes.
DNS-based Failover: Employing a DNS service that can be updated programmatically. Patroni’s API can trigger DNS record updates.
Proxy Layer: Implementing a dedicated PostgreSQL proxy (e.g., ProxySQL) that can be dynamically configured to point to the active primary.

For Magento, the HAProxy health checks ensure that traffic is only sent to healthy application instances. When a PostgreSQL failover occurs, the application needs to connect to the new primary. If using a VIP or dynamic DNS, this transition is largely transparent to the application. If Magento is configured with a static IP for the database, a mechanism must exist to update this configuration upon PostgreSQL failover.

Monitoring and Alerting

Robust monitoring is paramount. Implement checks for:

PostgreSQL cluster health (via Patroni API or etcd health).
Replication lag.
HAProxy backend health.
Redis cluster health and memory usage.
Disk space on all Droplets.
Application error rates and response times.

Tools like Prometheus with Alertmanager, Datadog, or DigitalOcean’s own monitoring can be leveraged. Configure alerts for critical events such as a PostgreSQL primary failure, HAProxy backend failures, or significant replication lag.