Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Magento 2 Deployments on Linode

Establishing a High-Availability PostgreSQL Cluster with Patroni

For a robust Magento 2 deployment, a highly available PostgreSQL database is non-negotiable. We’ll leverage Patroni, a template for PostgreSQL HA, to manage automatic failover. Patroni uses a distributed configuration store (like etcd, Consul, or ZooKeeper) to maintain cluster state and elect a leader. For simplicity and ease of deployment on Linode, we’ll use etcd.

This setup assumes you have at least three Linode instances. One will host etcd, and the others will run PostgreSQL nodes managed by Patroni. For production, consider a dedicated etcd cluster across multiple availability zones.

Deploying etcd for Patroni Coordination

First, install etcd on a dedicated Linode instance. We’ll configure it for basic peer discovery.

etcd Installation and Configuration

On your etcd node (e.g., `etcd-node.yourdomain.com`):

Install etcd (package manager varies by OS, using `apt` for Debian/Ubuntu):

sudo apt update
sudo apt install etcd -y

Configure etcd by editing `/etc/etcd/etcd.conf.yml`. Ensure it’s accessible by your PostgreSQL nodes. Replace `ETCD_NODE_IP` with the actual IP of your etcd node.

name: etcd-node
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://ETCD_NODE_IP:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://ETCD_NODE_IP:2380
initial-cluster: etcd-node=http://ETCD_NODE_IP:2380
initial-cluster-token: my-etcd-cluster-token
initial-cluster-state: new
proxy: false

Restart and enable the etcd service:

sudo systemctl restart etcd
sudo systemctl enable etcd

Verify etcd is running and accessible:

ETCDCTL_API=3 etcdctl member list
ETCDCTL_API=3 etcdctl endpoint health

Setting up Patroni on PostgreSQL Nodes

On each PostgreSQL node (e.g., `pg-node-1.yourdomain.com`, `pg-node-2.yourdomain.com`, `pg-node-3.yourdomain.com`):

Patroni Installation

Install PostgreSQL and Patroni. The exact PostgreSQL version should match your Magento 2 requirements.

# Install PostgreSQL (example for PostgreSQL 13 on Ubuntu)
sudo apt update
sudo apt install postgresql postgresql-contrib postgresql-13 -y

# Install Patroni (using pip is recommended for latest features)
sudo apt install python3-pip python3-dev build-essential libpq-dev -y
sudo pip3 install --upgrade pip
sudo pip3 install "patroni[etcd]" psycopg2-binary

Patroni Configuration

Create the Patroni configuration file, typically at `/etc/patroni/patroni.yml`. This configuration is crucial and must be identical across all PostgreSQL nodes, with only the `scope` (cluster name) and `name` (node identifier) differing.

Example `patroni.yml` for `pg-node-1.yourdomain.com`:

scope: magento_cluster
name: pg-node-1.yourdomain.com

restapi:
  listen: 0.0.0.0:8008
  connect_address: pg-node-1.yourdomain.com:8008

etcd:
  host: ETCD_NODE_IP:2379
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: pg-node-1.yourdomain.com:5432
  data_dir: /var/lib/postgresql/13/main
  pg_hba:
    - host    all             all             0.0.0.0/0               md5
  parameters:
    max_connections: 100
    shared_buffers: 128MB
    effective_cache_size: 384MB
    maintenance_work_mem: 64MB
    checkpoint_completion_target: 0.9
    wal_buffers: 16MB
    default_statistics_target: 100
    random_page_cost: 1.1
    effective_io_concurrency: 200
    work_mem: 16MB
    min_wal_size: 1GB
    max_wal_size: 4GB
    checkpoint_timeout: 5min
    max_age: 10min
    wal_keep_size: 1024MB # Patroni manages this, but good to set a baseline

replication:
  master:
    # Patroni will configure replication slots automatically
    # Ensure PostgreSQL user has replication privileges
    # This user will be created by Patroni if it doesn't exist.
    username: replicator
    password: YOUR_REPLICATION_PASSWORD

# Tags can be used for custom logic, e.g., routing read-only traffic
tags:
  nofailover: false
  clone: false

Important Notes:

Replace `ETCD_NODE_IP` with the actual IP of your etcd node.
Replace `pg-node-1.yourdomain.com` with the hostname of the current PostgreSQL node.
Ensure the `data_dir` matches your PostgreSQL installation.
The `pg_hba` configuration allows connections from any IP; restrict this in production.
`wal_keep_size` is managed by Patroni, but setting a reasonable default is good practice.
The `replicator` user and password will be managed by Patroni. Ensure the password is strong.
For production, consider using SSL for etcd communication and PostgreSQL connections.

Create a systemd service file for Patroni, e.g., `/etc/systemd/system/patroni.service`:

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Ensure the `postgres` user has read access to `/etc/patroni/patroni.yml` and write access to `data_dir`.

Initializing the PostgreSQL Cluster

On *one* of the PostgreSQL nodes (e.g., `pg-node-1.yourdomain.com`), start Patroni and PostgreSQL:

sudo systemctl start patroni
sudo systemctl enable patroni

Check the Patroni logs for initialization progress:

sudo journalctl -u patroni -f

Once the first node is initialized, start Patroni on the *other* PostgreSQL nodes. They will detect the existing cluster in etcd and join as replicas.

# On pg-node-2.yourdomain.com and pg-node-3.yourdomain.com
sudo systemctl start patroni
sudo systemctl enable patroni

Verify the cluster status using Patroni’s REST API:

curl http://pg-node-1.yourdomain.com:8008/cluster

You should see output indicating the cluster members and their roles (one master, others replicas).

Configuring Magento 2 for High Availability

Magento 2 needs to be configured to connect to the PostgreSQL cluster. The key is to use a connection string that can be dynamically updated or to use a load balancer that directs traffic to the current master.

Using a Virtual IP or Load Balancer

The most robust approach is to use a load balancer or a virtual IP (VIP) that always points to the current PostgreSQL master. On Linode, you can achieve this with:

HAProxy: A popular, high-performance TCP/HTTP load balancer.
Keepalived: For managing a floating VIP.

Let’s focus on HAProxy as it can also handle health checks and direct traffic appropriately.

HAProxy Setup

Install HAProxy on a separate node or one of the existing nodes (ensure it’s not a PostgreSQL node itself to avoid single points of failure).

sudo apt update
sudo apt install haproxy -y

Configure HAProxy by editing `/etc/haproxy/haproxy.cfg`. We’ll set up a frontend that directs traffic to the PostgreSQL master based on Patroni’s API.

[global]
log /dev/log local0
maxconn 4096
daemon

defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000

listen pgsql_cluster
bind *:5432
mode tcp
option tcp-check
# Use Patroni's API to find the master. This requires a custom script or a Patroni integration.
# For simplicity, we'll use a basic health check and rely on Patroni's leader election.
# A more advanced setup would involve a script that queries Patroni's API for the master's IP.
# For this example, we'll list all potential PostgreSQL nodes and let HAProxy health checks do the work.
# This is NOT fully automatic failover without a script.
# A better approach: Use a script to query Patroni API and update backend servers dynamically.

# Example with static backend list (requires manual intervention or a script to update)
# server pg-node-1 pg-node-1.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
# server pg-node-2 pg-node-2.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
# server pg-node-3 pg-node-3.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2

# A more dynamic approach using a script to query Patroni API:
# This requires a script that runs periodically or on-demand to update the backend servers.
# For a truly automated solution, consider using Patroni's built-in API to find the master.
# Example using a script (not provided here, but conceptually):
# server pg-master `get_pg_master_ip.sh` check port 5432 inter 2s fall 3 rise 2

# For a simpler, though less dynamic, setup, we can use Patroni's REST API to get the master.
# This requires a tool like 'jq' and a script to parse the output.
# Example using a script that updates HAProxy config dynamically:
# Let's assume a script `update_haproxy_backends.sh` exists.
# This script would query http://ETCD_NODE_IP:8008/cluster, find the master, and update the HAProxy config.
# Then, it would reload HAProxy.

# For demonstration, let's use a basic TCP check on all nodes.
# Patroni's REST API is the key to true automation.
# A common pattern is to have a script that polls Patroni's /cluster endpoint,
# identifies the master, and dynamically updates HAProxy's backend list.
# Then, HAProxy is reloaded.

# Example of a basic setup that relies on Patroni's leader election and HAProxy's health checks.
# This is NOT fully automated failover without a script to manage HAProxy backends.
# The 'check' directive will probe port 5432. Patroni ensures only the master listens on 5432 for external connections.
# However, replicas also listen on 5432 for replication.
# A better check would be to query Patroni's API.

# Let's use a more robust approach with a script that queries Patroni's API.
# This script would run via cron or a systemd timer.
# For now, we'll show the HAProxy config that *would* be managed.

# Example of a backend definition that would be dynamically managed:
# server pg-node-1 pg-node-1.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
# server pg-node-2 pg-node-2.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
# server pg-node-3 pg-node-3.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2

# A more direct way to use Patroni's API with HAProxy:
# HAProxy can be configured to use external scripts for health checks.
# Or, a separate process can update HAProxy's configuration.

# For a simpler, but less ideal, approach:
# Configure HAProxy to point to all nodes and rely on Patroni to ensure only the master is available for writes.
# This is not ideal as HAProxy might try to connect to replicas for writes.

# The recommended approach:
# 1. A script queries Patroni's /cluster API.
# 2. It identifies the master's IP and port.
# 3. It dynamically updates HAProxy's backend configuration.
# 4. It reloads HAProxy.

# Let's assume such a script exists and manages the 'server' lines below.
# The 'check port 5432' is a basic TCP check. A more advanced check would be to
# query Patroni's API for the master status.

# Example of backend servers that would be dynamically managed:
server pg-node-1 pg-node-1.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
server pg-node-2 pg-node-2.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
server pg-node-3 pg-node-3.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2

# Optional: Add a read-only listener if you want to direct read traffic to replicas
# listen pgsql_replicas
# bind *:5433
# mode tcp
# option tcp-check
# # This would point to replicas. Requires logic to identify replicas.
# server pg-node-2 pg-node-2.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2
# server pg-node-3 pg-node-3.yourdomain.com:5432 check port 5432 inter 2s fall 3 rise 2

Dynamic Backend Management Script (Conceptual):

import requests
import json
import subprocess
import os

PATRONI_API_URL = "http://ETCD_NODE_IP:8008/cluster"
HAPROXY_CONF_FILE = "/etc/haproxy/haproxy.cfg"
HAPROXY_BACKEND_SECTION = "pgsql_cluster"
PG_NODES = ["pg-node-1.yourdomain.com", "pg-node-2.yourdomain.com", "pg-node-3.yourdomain.com"]
PG_PORT = 5432

def get_master_node():
    try:
        response = requests.get(PATRONI_API_URL, timeout=5)
        response.raise_for_status()
        cluster_data = response.json()
        for member in cluster_data.get("members", []):
            if member.get("role") == "master":
                return member.get("api_url").split(":")[0].replace("http://", "")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching Patroni API: {e}")
    return None

def update_haproxy_config(master_node):
    if not master_node:
        print("No master node found, skipping HAProxy update.")
        return

    new_backend_lines = []
    for node in PG_NODES:
        server_line = f"  server {node.replace('.', '_')} {node}:{PG_PORT} check port {PG_PORT} inter 2s fall 3 rise 2"
        if node == master_node:
            # Optionally add specific checks for master, or just ensure it's in the list
            new_backend_lines.append(server_line)
        else:
            # For replicas, you might want different checks or just list them.
            # In a true HA setup, HAProxy should only direct writes to the master.
            # This example assumes HAProxy is configured to direct writes to the master.
            # If HAProxy is only for writes, replicas might not need to be in this list.
            # If HAProxy is for both reads and writes, you'd need a separate backend for replicas.
            new_backend_lines.append(server_line) # Including replicas for now, but writes should go to master

    try:
        with open(HAPROXY_CONF_FILE, 'r') as f:
            lines = f.readlines()

        in_backend_section = False
        new_lines = []
        for line in lines:
            if line.strip().startswith(f"listen {HAPROXY_BACKEND_SECTION}"):
                in_backend_section = True
                new_lines.append(line)
                # Add the dynamically generated server lines
                new_lines.extend(new_backend_lines)
                # Skip original server lines if they exist
                continue
            elif in_backend_section and (line.strip().startswith("server ") or line.strip() == ""):
                # Skip original server lines and empty lines within the section until the next listen/global
                if line.strip().startswith("server "):
                    continue
                elif line.strip() == "" and not new_backend_lines: # If no new backends, keep empty lines
                    new_lines.append(line)
                elif line.strip() == "" and new_backend_lines: # If new backends, skip empty lines that were before them
                    continue
            elif line.strip().startswith("listen ") or line.strip().startswith("[global]"):
                in_backend_section = False
                new_lines.append(line)
            else:
                new_lines.append(line)

        # Ensure the backend section is properly closed if it was the last section
        if in_backend_section and not new_lines[-1].strip().startswith("listen ") and not new_lines[-1].strip().startswith("[global]"):
             new_lines.append("\n") # Add a newline if needed

        with open(HAPROXY_CONF_FILE, 'w') as f:
            f.writelines(new_lines)

        print("HAProxy config updated. Reloading HAProxy...")
        subprocess.run(["sudo", "systemctl", "reload", "haproxy"], check=True)
        print("HAProxy reloaded.")

    except Exception as e:
        print(f"Error updating HAProxy config: {e}")

if __name__ == "__main__":
    master = get_master_node()
    if master:
        print(f"Detected master: {master}")
        update_haproxy_config(master)
    else:
        print("Could not detect master node.")

This script should be scheduled to run periodically (e.g., every minute) using cron or a systemd timer. It queries Patroni’s API, finds the master, and updates the HAProxy configuration file, then reloads HAProxy. Ensure the script has the necessary permissions to write to `/etc/haproxy/haproxy.cfg` and reload the `haproxy` service.

Restart HAProxy after updating the configuration:

sudo systemctl restart haproxy
sudo systemctl enable haproxy

Magento 2 Database Configuration

In your Magento 2 `app/etc/env.php` file, configure the database connection to point to the HAProxy listener.

<?php
return [
    'backend' => [
        'frontName' => 'admin_your_secret_path'
    ],
    'crypt' => [
        'key' => 'your_crypt_key_here'
    ],
    'db' => [
        'connection' => [
            'default' => [
                'host' => 'HAPROXY_IP_OR_HOSTNAME', // IP or hostname of your HAProxy server
                'port' => '5432', // The port HAProxy listens on for PostgreSQL
                'dbname' => 'magento_db',
                'username' => 'magento_user',
                'password' => 'YOUR_MAGENTO_DB_PASSWORD',
                'model' => 'mysql4',
                'initStatements' => 'SET NAMES utf8',
                'driver_options' => [
                    1002 => 'SET NAMES utf8'
                ]
            ],
            'innodb_strict_mode' => true,
            'use_ssl' => '0' // Set to '1' if using SSL
        ],
        'default_setup' => [
            'host' => 'HAPROXY_IP_OR_HOSTNAME',
            'port' => '5432',
            'dbname' => 'magento_db',
            'username' => 'magento_user',
            'password' => 'YOUR_MAGENTO_DB_PASSWORD',
            'model' => 'mysql4',
            'initStatements' => 'SET NAMES utf8',
            'driver_options' => [
                1002 => 'SET NAMES utf8'
            ]
        ]
    ],
    // ... other Magento configuration
];

Replace `HAPROXY_IP_OR_HOSTNAME` with the IP address or hostname of your HAProxy server. Magento will now connect through HAProxy, which will direct traffic to the current PostgreSQL master.

Testing Failover

To test the automatic failover:

Identify the current master node using Patroni’s API: `curl http://ETCD_NODE_IP:8008/cluster`.
Gracefully stop the PostgreSQL service on the master node: `sudo systemctl stop patroni`.
Observe Patroni logs on the remaining nodes. Patroni should detect the master’s failure, and one of the replicas should be promoted to master.
Verify the new master using Patroni’s API again.
Check if Magento can still connect and perform operations. The HAProxy script should have updated its backend to point to the new master.
Once the old master is back online, Patroni will configure it as a replica of the new master.

For a more aggressive test, you can simulate a network partition or forcefully kill the PostgreSQL process on the master node. Ensure your HAProxy health checks and the dynamic backend script are robust enough to handle these scenarios.

Securing the PostgreSQL Cluster

Production deployments require enhanced security:

SSL/TLS: Encrypt communication between Patroni nodes, etcd, PostgreSQL clients, and HAProxy.
Firewall Rules: Restrict access to PostgreSQL ports (5432) and Patroni API ports (8008) to only necessary IPs (e.g., HAProxy, Magento application servers).
etcd Security: Configure etcd with TLS and authentication.
PostgreSQL Authentication: Use strong passwords and consider certificate-based authentication.
HAProxy ACLs: Implement access control lists in HAProxy to further restrict access.

By implementing Patroni for PostgreSQL HA and HAProxy with dynamic backend management for traffic routing, you can achieve a highly available and resilient database layer for your Magento 2 deployment on Linode.