Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Magento 2 Deployments on OVH

Designing for Resilience: PostgreSQL and Magento 2 Auto-Failover on OVH

This document outlines a robust, automated failover architecture for PostgreSQL databases and Magento 2 applications deployed on OVHcloud infrastructure. The primary objective is to minimize downtime during hardware failures, network disruptions, or planned maintenance by implementing near-synchronous replication and automated detection/switching mechanisms.

PostgreSQL High Availability with Patroni and etcd

For PostgreSQL, we leverage Patroni, a template for PostgreSQL high availability. Patroni manages the PostgreSQL cluster lifecycle, including bootstrapping, replication, and failover. It relies on a distributed consensus store for leader election and cluster state management. We’ll use etcd for this purpose due to its robustness and widespread adoption.

Infrastructure Setup on OVH

A typical setup involves at least three PostgreSQL nodes for quorum and two etcd nodes. These should be deployed across different Availability Zones (AZs) within an OVH region for maximum resilience. For this example, assume the following:

PostgreSQL Nodes: pg-replica-01, pg-replica-02, pg-replica-03 (each with PostgreSQL installed)
etcd Nodes: etcd-01, etcd-02 (for a minimal HA setup, though 3+ is recommended for production)
Network: Private network for inter-node communication.

etcd Cluster Configuration

Each etcd node requires a configuration file. The following is a representative example for etcd-01. Ensure the --initial-cluster and --listen-peer-urls reflect your actual node hostnames and IP addresses.

`/etc/etcd/etcd.conf.yml` on `etcd-01`

name: etcd-01
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://:2379
listen-peer-urls: http://:2380
initial-advertise-peer-urls: http://:2380
initial-cluster: etcd-01=http://:2380,etcd-02=http://:2380
initial-cluster-token: my-etcd-cluster
initial-cluster-state: new
proxy: false
quota-backend-bytes: 8589934592
heartbeat-interval: 100
election-timeout: 1000

Repeat this configuration for etcd-02, adjusting the name and IP addresses accordingly. Start and enable the etcd service on all nodes:

sudo systemctl start etcd
sudo systemctl enable etcd

Patroni Configuration for PostgreSQL

Patroni configuration is typically provided in YAML format. This configuration defines how Patroni interacts with PostgreSQL and etcd. The scope should be unique for each cluster. The etcd section points to your etcd endpoints.

`/etc/patroni/patroni.yml` on each PostgreSQL node

scope: magento_pg_cluster
namespace: /service/magento_pg

restapi:
  listen: 0.0.0.0:8008
  connect_address: %(host)s:8008

etcd:
  hosts:
    - :2379
    - :2379
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: %(host)s:5432
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin
  pg_hba:
    - host replication replicator %(replica_ip)s/32 md5
    - host all all 0.0.0.0/0 md5
  replication:
    master_command: |
      pg_basebackup -h %(master_host)s -p %(master_port)s -U replicator -D %(target_dir)s --wal-method=stream --slot=replica_slot
    replica_command: |
      pg_basebackup -h %(master_host)s -p %(master_port)s -U replicator -D %(target_dir)s --wal-method=stream --slot=replica_slot
  parameters:
    max_connections: 200
    shared_buffers: 1GB
    effective_cache_size: 3GB
    maintenance_work_mem: 256MB
    checkpoint_completion_target: 0.9
    wal_buffers: 16MB
    default_statistics_target: 100
    random_page_cost: 1.1
    effective_io_concurrency: 200
    work_mem: 16MB
    min_wal_size: 1GB
    max_wal_size: 4GB
    cron:
      - tag: archive_wal
        schedule: "0 0 * * *"
        command: psql -c 'SELECT pg_catalog.pg_wal_archive_cleanup();'
        log: true
      - tag: vacuum_db
        schedule: "0 1 * * *"
        command: vacuumdb --all --analyze-in-stages
        log: true

tags:
  nofailover: false
  clonefrom: true

Important Notes:

Replace <ETCD_NODE_01_IP> and <ETCD_NODE_02_IP> with actual IPs.
Ensure the PostgreSQL user replicator exists and has replication privileges.
The pg_hba.conf entries are crucial for replication and client access. Adjust IPs and authentication methods as per your security policy.
data_dir and bin_dir must match your PostgreSQL installation.
The clonefrom: true tag allows Patroni to bootstrap a new replica from an existing master if needed.

Start Patroni on all PostgreSQL nodes. The first node to start will initialize the cluster. Subsequent nodes will join as replicas. Patroni will manage the PostgreSQL service itself.

sudo systemctl start patroni
sudo systemctl enable patroni

Connecting Magento 2 to the HA PostgreSQL Cluster

Magento 2 needs to connect to the *virtual IP* or *DNS name* that always points to the current PostgreSQL master. Patroni exposes a REST API that can be queried to find the master. A simple script or a load balancer can be used to abstract this.

Example: Dynamic PostgreSQL Host Resolution for Magento

This Python script acts as a simple proxy/resolver. It queries Patroni’s REST API for the master’s address and forwards the connection. This script would run on each Magento application server or a dedicated proxy server.

import http.client
import json
import sys
import socket
import select
import threading
import time

PATRONI_HOST = "localhost" # Or IP of a Patroni REST API endpoint
PATRONI_PORT = 8008
LISTEN_PORT = 5432
BUFFER_SIZE = 4096

def get_master_address():
    try:
        conn = http.client.HTTPConnection(PATRONI_HOST, PATRONI_PORT, timeout=5)
        conn.request("GET", "/primary")
        response = conn.getresponse()
        if response.status == 200:
            data = json.loads(response.read().decode())
            return data['host'], data['port']
        else:
            print(f"Error from Patroni: {response.status} {response.reason}", file=sys.stderr)
            return None, None
    except Exception as e:
        print(f"Error connecting to Patroni: {e}", file=sys.stderr)
        return None, None

def handle_client(client_socket, master_host, master_port):
    try:
        master_socket = socket.create_connection((master_host, master_port), timeout=10)
        client_connected, master_connected = True, True

        while client_connected and master_connected:
            rlist, wlist, xlist = select.select([client_socket, master_socket], [], [client_socket, master_socket], 5)

            if master_socket in xlist or client_socket in xlist:
                print("Socket error detected.", file=sys.stderr)
                break

            if client_socket in rlist:
                data = client_socket.recv(BUFFER_SIZE)
                if data:
                    master_socket.sendall(data)
                else:
                    client_connected = False

            if master_socket in rlist:
                data = master_socket.recv(BUFFER_SIZE)
                if data:
                    client_socket.sendall(data)
                else:
                    master_connected = False

    except socket.timeout:
        print("Socket timeout.", file=sys.stderr)
    except Exception as e:
        print(f"Error in client handler: {e}", file=sys.stderr)
    finally:
        client_socket.close()
        master_socket.close()
        print("Connection closed.")

def start_proxy():
    server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_socket.bind(('0.0.0.0', LISTEN_PORT))
    server_socket.listen(5)
    print(f"Proxy listening on port {LISTEN_PORT}")

    while True:
        master_host, master_port = get_master_address()
        if not master_host:
            print("Could not get master address, retrying in 5 seconds...", file=sys.stderr)
            time.sleep(5)
            continue

        print(f"Current master: {master_host}:{master_port}")

        try:
            client_socket, addr = server_socket.accept()
            print(f"Accepted connection from {addr}")
            thread = threading.Thread(target=handle_client, args=(client_socket, master_host, master_port))
            thread.daemon = True
            thread.start()
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"Error accepting connection: {e}", file=sys.stderr)

    server_socket.close()

if __name__ == "__main__":
    start_proxy()

Configure Magento’s app/etc/env.php to point to the IP address where this proxy script is running, on port 5432.

<?php
return [
    'backend' => [
        'frontName' => 'admin123'
    ],
    'db' => [
        'connection' => [
            'default' => [
                'host' => '127.0.0.1', // IP of the server running the proxy script
                'dbname' => 'magento_db',
                'username' => 'magento_user',
                'password' => 'secure_password',
                'model' => 'mysql4',
                'initStatements' => 'SET NAMES utf8',
                'engine' => 'innodb',
                'initStatements' => 'SET sql_mode = "";',
                'options' => [
                    PDO::ATTR_PERSISTENT => true,
                    PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true
                ]
            ]
        ],
        'default_setup' => [
            'connection' => [
                'default' => [
                    'host' => '127.0.0.1',
                    'dbname' => 'magento_db',
                    'username' => 'magento_user',
                    'password' => 'secure_password',
                    'model' => 'mysql4',
                    'initStatements' => 'SET NAMES utf8',
                    'engine' => 'innodb',
                    'initStatements' => 'SET sql_mode = "";',
                    'options' => [
                        PDO::ATTR_PERSISTENT => true,
                        PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true
                    ]
                ]
            ]
        ]
    ],
    // ... other configuration
];

Magento 2 Application Deployment and Failover

For Magento 2, high availability is achieved through stateless application servers and a shared, highly available storage solution. The database HA is handled by the PostgreSQL setup above. For the application layer, we focus on redundancy and load balancing.

OVH Load Balancer Configuration

OVHcloud’s Load Balancer service is essential. We’ll configure it to distribute traffic across multiple Magento application servers and perform health checks.

Load Balancer Setup

Frontend Configuration: Listen on port 80 (HTTP) and 443 (HTTPS). Use SSL termination here.
Backend Pool: Add all your Magento application servers (e.g., app-server-01, app-server-02, app-server-03).
Health Checks: Configure health checks to target a specific endpoint on your Magento application (e.g., /healthcheck.php). This endpoint should perform basic checks like database connectivity and cache status.
Algorithm: Round Robin or Least Connections are suitable.
Persistence: Enable session persistence (sticky sessions) if your Magento setup relies heavily on file-based sessions. However, using a distributed cache like Redis for sessions is highly recommended for true statelessness.

Magento Application Servers

Each Magento application server should be configured identically. Key considerations:

Statelessness: Ensure sessions are handled by an external store (Redis is ideal). Cache (Varnish, Redis) should also be external.
Shared Media: The pub/media directory must be accessible by all application servers. This can be achieved using NFS, GlusterFS, or a distributed object storage solution. OVHcloud’s Object Storage can be integrated for this purpose, though it adds latency. A more direct approach for performance is often an NFS share on dedicated storage.
Configuration: app/etc/env.php should point to the HA PostgreSQL proxy/resolver.
Deployment: Use a CI/CD pipeline to deploy code changes to all application servers simultaneously.

Example Health Check Endpoint (`public_html/healthcheck.php`)

<?php
require __DIR__ . '/../app/bootstrap.php';

use Magento\Framework\App\Bootstrap;
use Magento\Framework\App\State;

try {
    $bootstrap = Bootstrap::create(BP, $_SERVER);
    $appState = $bootstrap->getObjectManager()->get(State::class);
    $appState->setAreaCode('frontend'); // Or 'adminhtml' if needed

    // Attempt to get a database connection
    $connection = $bootstrap->getObjectManager()->get('Magento\Framework\DB\Adapter\AdapterInterface');
    $connection->getConnection()->query('SELECT 1'); // Simple query to check connection

    // Add checks for cache, Redis, etc. as needed

    http_response_code(200);
    echo "OK";
} catch (\Exception $e) {
    http_response_code(503); // Service Unavailable
    echo "Service Unavailable: " . $e->getMessage();
    // Log the error for debugging
    error_log("Healthcheck failed: " . $e->getMessage());
}

Ensure this file is accessible via HTTP and that your load balancer’s health check is configured to hit it. The response code is critical: 200 for healthy, non-2xx for unhealthy.

Automated Failover Workflow Summary

1. PostgreSQL Failure: If the primary PostgreSQL node fails, Patroni detects the loss of quorum or the primary’s heartbeat in etcd. It initiates a leader election among the remaining replicas. The replica with the highest priority (or first to respond) becomes the new primary.

2. Master Address Update: Patroni’s REST API is updated to reflect the new primary’s address. The Python proxy script (or equivalent) running on application servers periodically polls this API. Upon detecting a change, it updates its internal mapping of the PostgreSQL master.

3. Magento Application Reconnection: New connections from Magento application servers will now be directed to the new PostgreSQL primary via the updated proxy. Existing connections might be dropped and need to be re-established by Magento’s connection pool.

4. Load Balancer Failover: If a Magento application server fails, the OVH Load Balancer’s health checks will detect it. The LB will stop sending traffic to the unhealthy server and redirect all incoming requests to the remaining healthy application servers.

5. Recovery: Failed PostgreSQL nodes can be rebuilt and rejoin the cluster. Patroni will handle re-synchronization. Failed application servers can be replaced or repaired, and once healthy, the Load Balancer will automatically include them in the traffic distribution.

Monitoring and Alerting

A robust monitoring system is non-negotiable. Key metrics and alerts include:

PostgreSQL: Patroni cluster status (via REST API or etcd), replication lag, connection counts, query performance. Alerts on cluster instability or high lag.
etcd: Cluster health, leader status, disk I/O, network latency between nodes. Alerts on etcd cluster issues.
Magento Application Servers: CPU, memory, disk usage, Nginx/Apache error rates, application response times, PHP-FPM status. Alerts on resource exhaustion or high error rates.
Load Balancer: Backend server health status, traffic volume, error rates. Alerts on backend pool degradation.
External Services: Redis, Varnish, NFS server health.

Tools like Prometheus with Grafana, Zabbix, or Datadog can be integrated for comprehensive monitoring and alerting. OVHcloud’s own monitoring tools should also be leveraged.