Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Magento 2 Deployments on OVH
Designing for Resilience: PostgreSQL and Magento 2 Auto-Failover on OVH
This document outlines a robust, automated failover architecture for PostgreSQL databases and Magento 2 applications deployed on OVHcloud infrastructure. The primary objective is to minimize downtime during hardware failures, network disruptions, or planned maintenance by implementing near-synchronous replication and automated detection/switching mechanisms.
PostgreSQL High Availability with Patroni and etcd
For PostgreSQL, we leverage Patroni, a template for PostgreSQL high availability. Patroni manages the PostgreSQL cluster lifecycle, including bootstrapping, replication, and failover. It relies on a distributed consensus store for leader election and cluster state management. We’ll use etcd for this purpose due to its robustness and widespread adoption.
Infrastructure Setup on OVH
A typical setup involves at least three PostgreSQL nodes for quorum and two etcd nodes. These should be deployed across different Availability Zones (AZs) within an OVH region for maximum resilience. For this example, assume the following:
- PostgreSQL Nodes:
pg-replica-01,pg-replica-02,pg-replica-03(each with PostgreSQL installed) - etcd Nodes:
etcd-01,etcd-02(for a minimal HA setup, though 3+ is recommended for production) - Network: Private network for inter-node communication.
etcd Cluster Configuration
Each etcd node requires a configuration file. The following is a representative example for etcd-01. Ensure the --initial-cluster and --listen-peer-urls reflect your actual node hostnames and IP addresses.
/etc/etcd/etcd.conf.yml on etcd-01
name: etcd-01 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://:2379 listen-peer-urls: http:// :2380 initial-advertise-peer-urls: http:// :2380 initial-cluster: etcd-01=http:// :2380,etcd-02=http:// :2380 initial-cluster-token: my-etcd-cluster initial-cluster-state: new proxy: false quota-backend-bytes: 8589934592 heartbeat-interval: 100 election-timeout: 1000
Repeat this configuration for etcd-02, adjusting the name and IP addresses accordingly. Start and enable the etcd service on all nodes:
sudo systemctl start etcd sudo systemctl enable etcd
Patroni Configuration for PostgreSQL
Patroni configuration is typically provided in YAML format. This configuration defines how Patroni interacts with PostgreSQL and etcd. The scope should be unique for each cluster. The etcd section points to your etcd endpoints.
/etc/patroni/patroni.yml on each PostgreSQL node
scope: magento_pg_cluster
namespace: /service/magento_pg
restapi:
listen: 0.0.0.0:8008
connect_address: %(host)s:8008
etcd:
hosts:
- :2379
- :2379
protocol: http
postgresql:
listen: 0.0.0.0:5432
connect_address: %(host)s:5432
data_dir: /var/lib/postgresql/14/main
bin_dir: /usr/lib/postgresql/14/bin
pg_hba:
- host replication replicator %(replica_ip)s/32 md5
- host all all 0.0.0.0/0 md5
replication:
master_command: |
pg_basebackup -h %(master_host)s -p %(master_port)s -U replicator -D %(target_dir)s --wal-method=stream --slot=replica_slot
replica_command: |
pg_basebackup -h %(master_host)s -p %(master_port)s -U replicator -D %(target_dir)s --wal-method=stream --slot=replica_slot
parameters:
max_connections: 200
shared_buffers: 1GB
effective_cache_size: 3GB
maintenance_work_mem: 256MB
checkpoint_completion_target: 0.9
wal_buffers: 16MB
default_statistics_target: 100
random_page_cost: 1.1
effective_io_concurrency: 200
work_mem: 16MB
min_wal_size: 1GB
max_wal_size: 4GB
cron:
- tag: archive_wal
schedule: "0 0 * * *"
command: psql -c 'SELECT pg_catalog.pg_wal_archive_cleanup();'
log: true
- tag: vacuum_db
schedule: "0 1 * * *"
command: vacuumdb --all --analyze-in-stages
log: true
tags:
nofailover: false
clonefrom: true
Important Notes:
- Replace
<ETCD_NODE_01_IP>and<ETCD_NODE_02_IP>with actual IPs. - Ensure the PostgreSQL user
replicatorexists and has replication privileges. - The
pg_hba.confentries are crucial for replication and client access. Adjust IPs and authentication methods as per your security policy. data_dirandbin_dirmust match your PostgreSQL installation.- The
clonefrom: truetag allows Patroni to bootstrap a new replica from an existing master if needed.
Start Patroni on all PostgreSQL nodes. The first node to start will initialize the cluster. Subsequent nodes will join as replicas. Patroni will manage the PostgreSQL service itself.
sudo systemctl start patroni sudo systemctl enable patroni
Connecting Magento 2 to the HA PostgreSQL Cluster
Magento 2 needs to connect to the *virtual IP* or *DNS name* that always points to the current PostgreSQL master. Patroni exposes a REST API that can be queried to find the master. A simple script or a load balancer can be used to abstract this.
Example: Dynamic PostgreSQL Host Resolution for Magento
This Python script acts as a simple proxy/resolver. It queries Patroni’s REST API for the master’s address and forwards the connection. This script would run on each Magento application server or a dedicated proxy server.
import http.client
import json
import sys
import socket
import select
import threading
import time
PATRONI_HOST = "localhost" # Or IP of a Patroni REST API endpoint
PATRONI_PORT = 8008
LISTEN_PORT = 5432
BUFFER_SIZE = 4096
def get_master_address():
try:
conn = http.client.HTTPConnection(PATRONI_HOST, PATRONI_PORT, timeout=5)
conn.request("GET", "/primary")
response = conn.getresponse()
if response.status == 200:
data = json.loads(response.read().decode())
return data['host'], data['port']
else:
print(f"Error from Patroni: {response.status} {response.reason}", file=sys.stderr)
return None, None
except Exception as e:
print(f"Error connecting to Patroni: {e}", file=sys.stderr)
return None, None
def handle_client(client_socket, master_host, master_port):
try:
master_socket = socket.create_connection((master_host, master_port), timeout=10)
client_connected, master_connected = True, True
while client_connected and master_connected:
rlist, wlist, xlist = select.select([client_socket, master_socket], [], [client_socket, master_socket], 5)
if master_socket in xlist or client_socket in xlist:
print("Socket error detected.", file=sys.stderr)
break
if client_socket in rlist:
data = client_socket.recv(BUFFER_SIZE)
if data:
master_socket.sendall(data)
else:
client_connected = False
if master_socket in rlist:
data = master_socket.recv(BUFFER_SIZE)
if data:
client_socket.sendall(data)
else:
master_connected = False
except socket.timeout:
print("Socket timeout.", file=sys.stderr)
except Exception as e:
print(f"Error in client handler: {e}", file=sys.stderr)
finally:
client_socket.close()
master_socket.close()
print("Connection closed.")
def start_proxy():
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_socket.bind(('0.0.0.0', LISTEN_PORT))
server_socket.listen(5)
print(f"Proxy listening on port {LISTEN_PORT}")
while True:
master_host, master_port = get_master_address()
if not master_host:
print("Could not get master address, retrying in 5 seconds...", file=sys.stderr)
time.sleep(5)
continue
print(f"Current master: {master_host}:{master_port}")
try:
client_socket, addr = server_socket.accept()
print(f"Accepted connection from {addr}")
thread = threading.Thread(target=handle_client, args=(client_socket, master_host, master_port))
thread.daemon = True
thread.start()
except KeyboardInterrupt:
break
except Exception as e:
print(f"Error accepting connection: {e}", file=sys.stderr)
server_socket.close()
if __name__ == "__main__":
start_proxy()
Configure Magento’s app/etc/env.php to point to the IP address where this proxy script is running, on port 5432.
<?php
return [
'backend' => [
'frontName' => 'admin123'
],
'db' => [
'connection' => [
'default' => [
'host' => '127.0.0.1', // IP of the server running the proxy script
'dbname' => 'magento_db',
'username' => 'magento_user',
'password' => 'secure_password',
'model' => 'mysql4',
'initStatements' => 'SET NAMES utf8',
'engine' => 'innodb',
'initStatements' => 'SET sql_mode = "";',
'options' => [
PDO::ATTR_PERSISTENT => true,
PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true
]
]
],
'default_setup' => [
'connection' => [
'default' => [
'host' => '127.0.0.1',
'dbname' => 'magento_db',
'username' => 'magento_user',
'password' => 'secure_password',
'model' => 'mysql4',
'initStatements' => 'SET NAMES utf8',
'engine' => 'innodb',
'initStatements' => 'SET sql_mode = "";',
'options' => [
PDO::ATTR_PERSISTENT => true,
PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true
]
]
]
]
],
// ... other configuration
];
Magento 2 Application Deployment and Failover
For Magento 2, high availability is achieved through stateless application servers and a shared, highly available storage solution. The database HA is handled by the PostgreSQL setup above. For the application layer, we focus on redundancy and load balancing.
OVH Load Balancer Configuration
OVHcloud’s Load Balancer service is essential. We’ll configure it to distribute traffic across multiple Magento application servers and perform health checks.
Load Balancer Setup
- Frontend Configuration: Listen on port 80 (HTTP) and 443 (HTTPS). Use SSL termination here.
- Backend Pool: Add all your Magento application servers (e.g.,
app-server-01,app-server-02,app-server-03). - Health Checks: Configure health checks to target a specific endpoint on your Magento application (e.g.,
/healthcheck.php). This endpoint should perform basic checks like database connectivity and cache status. - Algorithm: Round Robin or Least Connections are suitable.
- Persistence: Enable session persistence (sticky sessions) if your Magento setup relies heavily on file-based sessions. However, using a distributed cache like Redis for sessions is highly recommended for true statelessness.
Magento Application Servers
Each Magento application server should be configured identically. Key considerations:
- Statelessness: Ensure sessions are handled by an external store (Redis is ideal). Cache (Varnish, Redis) should also be external.
- Shared Media: The
pub/mediadirectory must be accessible by all application servers. This can be achieved using NFS, GlusterFS, or a distributed object storage solution. OVHcloud’s Object Storage can be integrated for this purpose, though it adds latency. A more direct approach for performance is often an NFS share on dedicated storage. - Configuration:
app/etc/env.phpshould point to the HA PostgreSQL proxy/resolver. - Deployment: Use a CI/CD pipeline to deploy code changes to all application servers simultaneously.
Example Health Check Endpoint (public_html/healthcheck.php)
<?php
require __DIR__ . '/../app/bootstrap.php';
use Magento\Framework\App\Bootstrap;
use Magento\Framework\App\State;
try {
$bootstrap = Bootstrap::create(BP, $_SERVER);
$appState = $bootstrap->getObjectManager()->get(State::class);
$appState->setAreaCode('frontend'); // Or 'adminhtml' if needed
// Attempt to get a database connection
$connection = $bootstrap->getObjectManager()->get('Magento\Framework\DB\Adapter\AdapterInterface');
$connection->getConnection()->query('SELECT 1'); // Simple query to check connection
// Add checks for cache, Redis, etc. as needed
http_response_code(200);
echo "OK";
} catch (\Exception $e) {
http_response_code(503); // Service Unavailable
echo "Service Unavailable: " . $e->getMessage();
// Log the error for debugging
error_log("Healthcheck failed: " . $e->getMessage());
}
Ensure this file is accessible via HTTP and that your load balancer’s health check is configured to hit it. The response code is critical: 200 for healthy, non-2xx for unhealthy.
Automated Failover Workflow Summary
1. PostgreSQL Failure: If the primary PostgreSQL node fails, Patroni detects the loss of quorum or the primary’s heartbeat in etcd. It initiates a leader election among the remaining replicas. The replica with the highest priority (or first to respond) becomes the new primary.
2. Master Address Update: Patroni’s REST API is updated to reflect the new primary’s address. The Python proxy script (or equivalent) running on application servers periodically polls this API. Upon detecting a change, it updates its internal mapping of the PostgreSQL master.
3. Magento Application Reconnection: New connections from Magento application servers will now be directed to the new PostgreSQL primary via the updated proxy. Existing connections might be dropped and need to be re-established by Magento’s connection pool.
4. Load Balancer Failover: If a Magento application server fails, the OVH Load Balancer’s health checks will detect it. The LB will stop sending traffic to the unhealthy server and redirect all incoming requests to the remaining healthy application servers.
5. Recovery: Failed PostgreSQL nodes can be rebuilt and rejoin the cluster. Patroni will handle re-synchronization. Failed application servers can be replaced or repaired, and once healthy, the Load Balancer will automatically include them in the traffic distribution.
Monitoring and Alerting
A robust monitoring system is non-negotiable. Key metrics and alerts include:
- PostgreSQL: Patroni cluster status (via REST API or etcd), replication lag, connection counts, query performance. Alerts on cluster instability or high lag.
- etcd: Cluster health, leader status, disk I/O, network latency between nodes. Alerts on etcd cluster issues.
- Magento Application Servers: CPU, memory, disk usage, Nginx/Apache error rates, application response times, PHP-FPM status. Alerts on resource exhaustion or high error rates.
- Load Balancer: Backend server health status, traffic volume, error rates. Alerts on backend pool degradation.
- External Services: Redis, Varnish, NFS server health.
Tools like Prometheus with Grafana, Zabbix, or Datadog can be integrated for comprehensive monitoring and alerting. OVHcloud’s own monitoring tools should also be leveraged.