Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Shopify Deployments on OVH

Leveraging OVH’s Infrastructure for Automated PostgreSQL Failover

For mission-critical applications, particularly those powered by PostgreSQL and fronted by platforms like Shopify, achieving robust disaster recovery through automated failover is paramount. This document outlines a practical, production-ready architecture leveraging OVH’s cloud offerings, focusing on high availability for PostgreSQL and seamless integration with Shopify deployments.

Our strategy centers on a multi-region or multi-zone PostgreSQL setup with automated detection and failover mechanisms. We’ll utilize OVH’s Public Cloud infrastructure, specifically their robust network capabilities and instance management, to build a resilient database layer. The core components include a primary PostgreSQL instance, a streaming replica, and a dedicated failover manager.

PostgreSQL High Availability with Streaming Replication

The foundation of our high availability setup is PostgreSQL’s built-in streaming replication. This ensures that data written to the primary server is asynchronously or synchronously replicated to one or more standby servers. For automated failover, we need at least one standby that can be promoted to primary.

Consider a scenario with two OVH Public Cloud instances in different Availability Zones within the same region for low latency, or across different regions for true disaster recovery. We’ll configure one as the primary and the other as a streaming replica.

Primary PostgreSQL Server Configuration

On the primary server (e.g., `pg-primary.example.com`), ensure `postgresql.conf` is tuned for replication. Key parameters include:

wal_level = replica
max_wal_senders = 5
wal_keep_size = 1024 # Or a sufficiently large value to prevent WAL file deletion before replica can fetch them
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Example: local archive, consider S3/Object Storage for DR

Also, configure `pg_hba.conf` to allow replication connections from the standby server. Assuming the standby has an IP of `192.168.1.101`:

host    replication     replicator      192.168.1.101/32        md5

Create a replication user:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_replication_password';

Streaming Replica Configuration

On the replica server (e.g., `pg-replica.example.com`), after initial data synchronization (using `pg_basebackup`), configure `postgresql.conf` for replication:

hot_standby = on
primary_conninfo = 'host=pg-primary.example.com port=5432 user=replicator password=your_replication_password'
primary_slot_name = 'replication_slot_for_replica' # Recommended for robust replication

Create a replication slot on the primary server to prevent WAL files from being removed prematurely if the replica is temporarily down:

-- On primary server
SELECT pg_create_physical_replication_slot('replication_slot_for_replica');

Ensure the replica can connect to the primary. If using private IPs within OVH’s network, ensure firewall rules (Security Groups/Network ACLs) permit traffic on port 5432.

Automated Failover Orchestration with Patroni

While PostgreSQL streaming replication provides data redundancy, it doesn’t automate the failover process. For this, we introduce Patroni, a template for PostgreSQL high-availability solutions. Patroni manages the PostgreSQL cluster state, monitors instances, and orchestrates failover.

Patroni requires a distributed configuration store. etcd is a popular choice, but for simplicity and integration within OVH’s ecosystem, we can also consider using a managed Kubernetes cluster (OVH Public Cloud Kubernetes Service) where etcd can be deployed as a highly available cluster, or even leverage a managed Redis service if available and suitable.

Setting up Patroni with etcd

Deploy etcd as a clustered service. For a production setup, a minimum of 3 etcd nodes are recommended for quorum. If using OVH Public Cloud Kubernetes, you can deploy etcd as StatefulSets.

Install Patroni on each PostgreSQL server. A typical Patroni configuration file (`patroni.yml`) would look like this:

# patroni.yml on pg-primary.example.com
scope: my_pg_cluster
name: pg-primary.example.com

etcd:
  host: etcd-node1.example.com:2379,etcd-node2.example.com:2379,etcd-node3.example.com:2379
  protocol: http # Use https for production with TLS

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  config_dir: /etc/postgresql/{{.PGVersion}}/main
  bin_dir: /usr/lib/postgresql/{{.PGVersion}}/bin
  pg_hba:
    - host    all       all       0.0.0.0/0       md5
  parameters:
    wal_level: replica
    max_wal_senders: 5
    hot_standby: on
    archive_mode: on
    archive_command: 'cp %p /var/lib/postgresql/wal_archive/%f'

# Replication configuration
replication:
  username: replicator
  password: your_replication_password
  ssl: false # Set to true if using SSL

# Tags for instance identification
tags:
  nofailover: false
  clonefrom: false

On the replica server, the `patroni.yml` would be similar, with the `name` parameter set to the replica’s hostname and potentially different `listen` addresses if running multiple PostgreSQL instances on the same host. Crucially, the `postgresql.parameters` section will include `primary_conninfo` and `primary_slot_name` which Patroni will manage.

Patroni will automatically configure the replica based on the primary’s settings and the `primary_conninfo`. When the primary fails, Patroni nodes will detect this via etcd, elect a leader, and promote one of the healthy replicas.

Starting Patroni and PostgreSQL

Ensure PostgreSQL is installed and configured to *not* auto-start. Patroni will manage its lifecycle.

# On each PostgreSQL node
sudo systemctl stop postgresql
sudo systemctl disable postgresql

# Start Patroni
sudo systemctl start patroni
sudo systemctl enable patroni

Patroni will then start PostgreSQL, configure it for replication, and register itself in etcd. The first node to start will initialize the cluster as primary. Subsequent nodes will join as replicas.

Integrating with Shopify: Connection Pooling and Load Balancing

Shopify applications typically connect to PostgreSQL via a connection string. For high availability, this connection string needs to be dynamic, pointing to the *current* primary. Directing traffic to a single IP address is a single point of failure. We need a layer of abstraction.

Using HAProxy for Dynamic Endpoint Discovery

HAProxy is an excellent choice for load balancing and high availability. We can configure HAProxy to monitor the PostgreSQL cluster’s health and dynamically update its backend list to point to the active primary.

Patroni exposes an API that HAProxy can query to determine the current primary. We’ll deploy HAProxy on separate instances, potentially in an active-passive or active-active configuration for HAProxy itself.

HAProxy Configuration for PostgreSQL

A simplified HAProxy configuration (`haproxy.cfg`) for PostgreSQL failover:

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode tcp
    option tcplog
    option dontlognull
    timeout connect 5000
    timeout client 50000
    timeout server 50000
    retries 3

listen pg_cluster
    bind *:5432
    mode tcp
    option tcp-check
    # Patroni API endpoint for health checks
    tcp-check connect port 8008 proto HTTP
    tcp-check send GET /primary HTTP/1.1\r\nHost: localhost\r\n\r\n
    tcp-check expect status 200
    # Backend servers - these will be dynamically managed or updated
    # For simplicity, we list potential candidates. Patroni's API is the true source of truth.
    server pg-primary pg-primary.example.com:5432 check port 5432 inter 2s fall 3 rise 2
    server pg-replica pg-replica.example.com:5432 port 5432 inter 2s fall 3 rise 2

Important Note: The above HAProxy configuration is a starting point. For true dynamic backend management based on Patroni’s API, you would typically use a script that periodically queries Patroni’s `/primary` endpoint and reloads HAProxy. Alternatively, tools like `patroni-vip` or custom Kubernetes service discovery can manage this more elegantly.

The Shopify application’s database connection string should then point to the HAProxy instance’s IP address and port (e.g., `postgresql://user:[email protected]:5432/mydatabase`).

OVH Specific Considerations and Best Practices

Network Configuration

Utilize OVH’s private network capabilities for PostgreSQL replication traffic. This ensures low latency and avoids consuming public bandwidth. Configure security groups (Firewall as a Service) to allow necessary ports (5432 for PostgreSQL, 2379/2380 for etcd, 8008 for Patroni API) between your instances.

Instance Sizing and Storage

Choose OVH instances with sufficient CPU, RAM, and IOPS for your PostgreSQL workload. For database storage, consider OVH’s Block Storage (SSD) for performance. Ensure the storage is mounted and configured correctly on all PostgreSQL nodes. For WAL archiving, consider OVH’s Object Storage for cost-effective, durable off-site backups.

Monitoring and Alerting

Implement comprehensive monitoring using tools like Prometheus and Grafana. Key metrics to track include:

PostgreSQL replication lag (via `pg_stat_replication` and Patroni’s API).
etcd cluster health.
Patroni cluster status.
HAProxy backend health.
Instance resource utilization (CPU, memory, disk I/O, network).

Configure alerts for critical events such as replication lag exceeding a threshold, primary unavailability, or HAProxy backend failures. OVH’s native monitoring tools can also be integrated.

Testing Failover

Regularly test the failover process. This can be done by gracefully shutting down the primary PostgreSQL instance (allowing Patroni to initiate a controlled failover) or by simulating a failure (e.g., stopping the Patroni service on the primary, or even terminating the instance). Verify that HAProxy correctly redirects traffic to the new primary and that the Shopify application remains accessible with minimal interruption.

Automated failover is not a set-and-forget solution. It requires diligent setup, continuous monitoring, and periodic testing to ensure its effectiveness in a real disaster scenario.