Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Shopify Deployments on OVH
Leveraging OVH’s Infrastructure for Automated PostgreSQL Failover
For mission-critical applications, particularly those powered by PostgreSQL and fronted by platforms like Shopify, achieving robust disaster recovery through automated failover is paramount. This document outlines a practical, production-ready architecture leveraging OVH’s cloud offerings, focusing on high availability for PostgreSQL and seamless integration with Shopify deployments.
Our strategy centers on a multi-region or multi-zone PostgreSQL setup with automated detection and failover mechanisms. We’ll utilize OVH’s Public Cloud infrastructure, specifically their robust network capabilities and instance management, to build a resilient database layer. The core components include a primary PostgreSQL instance, a streaming replica, and a dedicated failover manager.
PostgreSQL High Availability with Streaming Replication
The foundation of our high availability setup is PostgreSQL’s built-in streaming replication. This ensures that data written to the primary server is asynchronously or synchronously replicated to one or more standby servers. For automated failover, we need at least one standby that can be promoted to primary.
Consider a scenario with two OVH Public Cloud instances in different Availability Zones within the same region for low latency, or across different regions for true disaster recovery. We’ll configure one as the primary and the other as a streaming replica.
Primary PostgreSQL Server Configuration
On the primary server (e.g., `pg-primary.example.com`), ensure `postgresql.conf` is tuned for replication. Key parameters include:
wal_level = replica max_wal_senders = 5 wal_keep_size = 1024 # Or a sufficiently large value to prevent WAL file deletion before replica can fetch them archive_mode = on archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Example: local archive, consider S3/Object Storage for DR
Also, configure `pg_hba.conf` to allow replication connections from the standby server. Assuming the standby has an IP of `192.168.1.101`:
host replication replicator 192.168.1.101/32 md5
Create a replication user:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_replication_password';
Streaming Replica Configuration
On the replica server (e.g., `pg-replica.example.com`), after initial data synchronization (using `pg_basebackup`), configure `postgresql.conf` for replication:
hot_standby = on primary_conninfo = 'host=pg-primary.example.com port=5432 user=replicator password=your_replication_password' primary_slot_name = 'replication_slot_for_replica' # Recommended for robust replication
Create a replication slot on the primary server to prevent WAL files from being removed prematurely if the replica is temporarily down:
-- On primary server
SELECT pg_create_physical_replication_slot('replication_slot_for_replica');
Ensure the replica can connect to the primary. If using private IPs within OVH’s network, ensure firewall rules (Security Groups/Network ACLs) permit traffic on port 5432.
Automated Failover Orchestration with Patroni
While PostgreSQL streaming replication provides data redundancy, it doesn’t automate the failover process. For this, we introduce Patroni, a template for PostgreSQL high-availability solutions. Patroni manages the PostgreSQL cluster state, monitors instances, and orchestrates failover.
Patroni requires a distributed configuration store. etcd is a popular choice, but for simplicity and integration within OVH’s ecosystem, we can also consider using a managed Kubernetes cluster (OVH Public Cloud Kubernetes Service) where etcd can be deployed as a highly available cluster, or even leverage a managed Redis service if available and suitable.
Setting up Patroni with etcd
Deploy etcd as a clustered service. For a production setup, a minimum of 3 etcd nodes are recommended for quorum. If using OVH Public Cloud Kubernetes, you can deploy etcd as StatefulSets.
Install Patroni on each PostgreSQL server. A typical Patroni configuration file (`patroni.yml`) would look like this:
# patroni.yml on pg-primary.example.com
scope: my_pg_cluster
name: pg-primary.example.com
etcd:
host: etcd-node1.example.com:2379,etcd-node2.example.com:2379,etcd-node3.example.com:2379
protocol: http # Use https for production with TLS
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/data
config_dir: /etc/postgresql/{{.PGVersion}}/main
bin_dir: /usr/lib/postgresql/{{.PGVersion}}/bin
pg_hba:
- host all all 0.0.0.0/0 md5
parameters:
wal_level: replica
max_wal_senders: 5
hot_standby: on
archive_mode: on
archive_command: 'cp %p /var/lib/postgresql/wal_archive/%f'
# Replication configuration
replication:
username: replicator
password: your_replication_password
ssl: false # Set to true if using SSL
# Tags for instance identification
tags:
nofailover: false
clonefrom: false
On the replica server, the `patroni.yml` would be similar, with the `name` parameter set to the replica’s hostname and potentially different `listen` addresses if running multiple PostgreSQL instances on the same host. Crucially, the `postgresql.parameters` section will include `primary_conninfo` and `primary_slot_name` which Patroni will manage.
Patroni will automatically configure the replica based on the primary’s settings and the `primary_conninfo`. When the primary fails, Patroni nodes will detect this via etcd, elect a leader, and promote one of the healthy replicas.
Starting Patroni and PostgreSQL
Ensure PostgreSQL is installed and configured to *not* auto-start. Patroni will manage its lifecycle.
# On each PostgreSQL node sudo systemctl stop postgresql sudo systemctl disable postgresql # Start Patroni sudo systemctl start patroni sudo systemctl enable patroni
Patroni will then start PostgreSQL, configure it for replication, and register itself in etcd. The first node to start will initialize the cluster as primary. Subsequent nodes will join as replicas.
Integrating with Shopify: Connection Pooling and Load Balancing
Shopify applications typically connect to PostgreSQL via a connection string. For high availability, this connection string needs to be dynamic, pointing to the *current* primary. Directing traffic to a single IP address is a single point of failure. We need a layer of abstraction.
Using HAProxy for Dynamic Endpoint Discovery
HAProxy is an excellent choice for load balancing and high availability. We can configure HAProxy to monitor the PostgreSQL cluster’s health and dynamically update its backend list to point to the active primary.
Patroni exposes an API that HAProxy can query to determine the current primary. We’ll deploy HAProxy on separate instances, potentially in an active-passive or active-active configuration for HAProxy itself.
HAProxy Configuration for PostgreSQL
A simplified HAProxy configuration (`haproxy.cfg`) for PostgreSQL failover:
global
log /dev/log local0
log /dev/log local1 notice
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
retries 3
listen pg_cluster
bind *:5432
mode tcp
option tcp-check
# Patroni API endpoint for health checks
tcp-check connect port 8008 proto HTTP
tcp-check send GET /primary HTTP/1.1\r\nHost: localhost\r\n\r\n
tcp-check expect status 200
# Backend servers - these will be dynamically managed or updated
# For simplicity, we list potential candidates. Patroni's API is the true source of truth.
server pg-primary pg-primary.example.com:5432 check port 5432 inter 2s fall 3 rise 2
server pg-replica pg-replica.example.com:5432 port 5432 inter 2s fall 3 rise 2
Important Note: The above HAProxy configuration is a starting point. For true dynamic backend management based on Patroni’s API, you would typically use a script that periodically queries Patroni’s `/primary` endpoint and reloads HAProxy. Alternatively, tools like `patroni-vip` or custom Kubernetes service discovery can manage this more elegantly.
The Shopify application’s database connection string should then point to the HAProxy instance’s IP address and port (e.g., `postgresql://user:[email protected]:5432/mydatabase`).
OVH Specific Considerations and Best Practices
Network Configuration
Utilize OVH’s private network capabilities for PostgreSQL replication traffic. This ensures low latency and avoids consuming public bandwidth. Configure security groups (Firewall as a Service) to allow necessary ports (5432 for PostgreSQL, 2379/2380 for etcd, 8008 for Patroni API) between your instances.
Instance Sizing and Storage
Choose OVH instances with sufficient CPU, RAM, and IOPS for your PostgreSQL workload. For database storage, consider OVH’s Block Storage (SSD) for performance. Ensure the storage is mounted and configured correctly on all PostgreSQL nodes. For WAL archiving, consider OVH’s Object Storage for cost-effective, durable off-site backups.
Monitoring and Alerting
Implement comprehensive monitoring using tools like Prometheus and Grafana. Key metrics to track include:
- PostgreSQL replication lag (via `pg_stat_replication` and Patroni’s API).
- etcd cluster health.
- Patroni cluster status.
- HAProxy backend health.
- Instance resource utilization (CPU, memory, disk I/O, network).
Configure alerts for critical events such as replication lag exceeding a threshold, primary unavailability, or HAProxy backend failures. OVH’s native monitoring tools can also be integrated.
Testing Failover
Regularly test the failover process. This can be done by gracefully shutting down the primary PostgreSQL instance (allowing Patroni to initiate a controlled failover) or by simulating a failure (e.g., stopping the Patroni service on the primary, or even terminating the instance). Verify that HAProxy correctly redirects traffic to the new primary and that the Shopify application remains accessible with minimal interruption.
Automated failover is not a set-and-forget solution. It requires diligent setup, continuous monitoring, and periodic testing to ensure its effectiveness in a real disaster scenario.