Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Shopify Deployments on Linode

Establishing a High-Availability PostgreSQL Cluster on Linode

Achieving true disaster recovery for a critical application like Shopify necessitates a robust, automated failover strategy for its data layer. For PostgreSQL, this means moving beyond simple replication to a setup that can detect primary node failure and seamlessly promote a replica without manual intervention. We’ll architect this using Patroni, a template for PostgreSQL HA, and leverage Linode’s infrastructure for health checks and load balancing.

Patroni: The Core of PostgreSQL HA

Patroni simplifies PostgreSQL HA by managing replication, failover, and configuration. It relies on a distributed consensus store (like etcd, Consul, or ZooKeeper) to maintain cluster state. For this setup, we’ll use etcd due to its simplicity and widespread adoption.

Setting up etcd Cluster

A minimum of three etcd nodes is recommended for quorum and fault tolerance. Deploy these on separate Linode instances, ideally in different availability zones if your region supports it.

etcd Configuration (etcd.conf.yml)

On each etcd node, create an etcd.conf.yml file. This example assumes a simple setup; production environments should incorporate TLS for secure communication.

name: etcd-node-1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://:2380
initial-cluster: etcd-node-1=http://:2380,etcd-node-2=http://:2380,etcd-node-3=http://:2380
initial-cluster-state: new
cluster-state: new

Replace <ETCD_NODE_X_IP> with the actual private IP addresses of your etcd nodes. Start the etcd service:

sudo systemctl enable etcd
sudo systemctl start etcd

Deploying Patroni with PostgreSQL

We’ll deploy Patroni and PostgreSQL on separate Linode instances. A common pattern is to have at least two PostgreSQL nodes (one primary, one replica) and potentially more for read scaling. Patroni will manage the promotion of replicas.

Patroni Configuration (patroni.yml)

Create a patroni.yml file on each PostgreSQL node. This configuration points to the etcd cluster and defines PostgreSQL settings.

scope: shopify-pg-cluster
namespace: /service/
restapi:
  listen: 0.0.0.0:8008
  connect_address: <POSTGRES_NODE_IP>:8008
etcd:
  host: <ETCD_NODE_1_IP>:2379, <ETCD_NODE_2_IP>:2379, <ETCD_NODE_3_IP>:2379
postgresql:
  listen: 0.0.0.0:5432
  connect_address: <POSTGRES_NODE_IP>:5432
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin
  pg_hba:
    - host    all             all             0.0.0.0/0               md5
  parameters:
    max_connections: 100
    shared_buffers: 1GB
    effective_cache_size: 3GB
    maintenance_work_mem: 512MB
    logging_collector: on
    log_directory: /var/log/postgresql
    log_filename: postgresql-%Y-%m-%d_%H-%M-%S.log
    log_statement: 'ddl'
    log_replication_commands: on
    wal_level: replica
    hot_standby: on
    max_wal_senders: 10
    max_replication_slots: 10
    synchronous_commit: on
    synchronous_standby_names: "1" # Set to '1' for synchronous replication, or a replica name for specific sync
replication:
  synchronous: true # Set to true for synchronous replication
  synchronous_mode: 'remote_write' # or 'remote_apply'
  retry_timeout: 10
  max_retries: 5

Replace <POSTGRES_NODE_IP> with the IP of the current PostgreSQL node and <ETCD_NODE_X_IP> with your etcd node IPs. Ensure the PostgreSQL user has permissions to create databases and roles. Install Patroni and PostgreSQL on each node. Then, start Patroni:

sudo systemctl enable patroni
sudo systemctl start patroni

Health Checks and Load Balancing with Linode NodeBalancers

Linode NodeBalancers are crucial for directing traffic to the active PostgreSQL primary. We’ll configure a NodeBalancer to listen on port 5432 and forward traffic to the Patroni REST API endpoints (port 8008) on each PostgreSQL node for health checks.

NodeBalancer Configuration

Create a NodeBalancer in the Linode Cloud Manager. Add a frontend listener for port 5432 (TCP). For the backend nodes, add your PostgreSQL instances. Configure the health check to target the Patroni API’s /primary endpoint. A simple check would be:

Protocol: HTTP
Port: 8008
Path: /primary
Check Interval: 5s
Response Timeout: 2s
Healthy Threshold: 2
Unhealthy Threshold: 3

This setup ensures that the NodeBalancer only sends traffic to the node that Patroni has designated as the primary. When a failover occurs, Patroni updates etcd, and the NodeBalancer’s health checks will automatically redirect traffic to the newly promoted primary.

Automating Shopify Application Failover

While PostgreSQL handles its own data layer failover, the Shopify application instances need to be aware of and adapt to this change. This involves updating connection strings and potentially restarting services.

Application Configuration Management

Store your database connection details in a configuration file or environment variables that can be dynamically updated. For example, a config/database.php in a Laravel application:

<?php

return [
    'default' => env('DB_CONNECTION', 'mysql'),
    'connections' => [
        'pgsql' => [
            'driver' => 'pgsql',
            'host' => env('DB_HOST', 'localhost'), // This will be the NodeBalancer IP
            'port' => env('DB_PORT', '5432'),
            'database' => env('DB_DATABASE', 'forge'),
            'username' => env('DB_USERNAME', 'forge'),
            'password' => env('DB_PASSWORD', ''),
            'charset' => 'utf8',
            'prefix' => '',
            'schema' => 'public',
            'sslmode' => 'prefer',
        ],
    ],
];
?>

The DB_HOST environment variable should point to the IP address of your Linode NodeBalancer for PostgreSQL.

Implementing a Failover Trigger Mechanism

When a PostgreSQL failover occurs, Patroni can be configured to execute custom scripts. This is done via the callbacks section in patroni.yml.

Callback Script Example (`/opt/patroni/scripts/on_role_change.sh`)

This script will be executed on the node that changes its role (e.g., from replica to primary). It needs to trigger an update of application configuration or restart application services.

#!/bin/bash

ROLE=$1
CLUSTER_NAME=$2
ETCD_HOSTS="$3" # Comma-separated list of etcd hosts

# Function to get the current primary IP from Patroni API
get_primary_ip() {
    # Iterate through Patroni REST API endpoints to find the primary
    # This is a simplified example; a more robust solution might query etcd directly
    # or have a dedicated service discovery mechanism.
    # For simplicity, we assume Patroni REST API is accessible on each node.
    # A better approach would be to query etcd for the leader key.
    local leader_key="/service/shopify-pg-cluster/leader"
    local leader_info=$(etcdctl --endpoints="$ETCD_HOSTS" get "$leader_key" --print-value-only 2>&1)

    if [[ $? -eq 0 && -n "$leader_info" ]]; then
        echo "$leader_info" | jq -r '.host' | sed 's/:.*//'
    else
        echo ""
    fi
}

# Function to update application configuration (e.g., .env file)
update_app_config() {
    local primary_ip=$1
    if [ -z "$primary_ip" ]; then
        echo "Error: Could not determine primary PostgreSQL IP."
        exit 1
    fi

    echo "Detected PostgreSQL primary change. New primary IP: $primary_ip"

    # Example: Update .env file for a Laravel application
    # This requires the application to be configured to read DB_HOST from .env
    if [ -f "/var/www/shopify-app/.env" ]; then
        sed -i "s/^DB_HOST=.*/DB_HOST=$primary_ip/" "/var/www/shopify-app/.env"
        echo "Updated DB_HOST in /var/www/shopify-app/.env"

        # Optionally, clear cache or restart services
        # cd /var/www/shopify-app && php artisan config:cache && sudo systemctl restart php-fpm
    else
        echo "Warning: .env file not found at /var/www/shopify-app/.env. Manual intervention may be required."
    fi
}

# Main logic
if [ "$ROLE" == "master" ]; then
    # This node has become the primary
    PRIMARY_IP=$(get_primary_ip)
    update_app_config "$PRIMARY_IP"
    # Potentially restart application services here if not handled by config:cache
    # sudo systemctl restart your-app-service
elif [ "$ROLE" == "replica" ]; then
    # This node has become a replica
    echo "Node $HOSTNAME is now a replica."
    # No action needed for application config, as it points to NodeBalancer
fi

exit 0

Make the script executable and add it to your Patroni configuration:

# ... within patroni.yml ...
callbacks:
  on_role_change: /opt/patroni/scripts/on_role_change.sh

The get_primary_ip function in the script is a simplified example. A more robust approach would involve querying etcd directly for the leader key or using a dedicated service discovery tool. The script assumes etcdctl is installed and configured to reach the etcd cluster. The update_app_config function demonstrates updating a Laravel .env file. Adjust this part based on your application’s configuration management strategy.

Orchestrating Application Restarts

After the configuration is updated, application services might need to be restarted to pick up the new database host. This can be done directly within the callback script or orchestrated by a configuration management tool like Ansible or Chef. For immediate effect, a `systemctl restart` command for your application’s service (e.g., `php-fpm`, `gunicorn`, `uwsgi`) can be added to the callback script.

Testing and Validation

Thorough testing is paramount. Simulate failures by:

Stopping the PostgreSQL process on the primary node.
Stopping the Patroni process on the primary node.
Simulating network partitions between PostgreSQL nodes and etcd.
Restarting the NodeBalancer health check interval to observe failover timing.

Monitor etcd, Patroni logs, PostgreSQL logs, and application logs during these tests to ensure the failover is smooth and the application correctly reconnects to the new primary. Verify that the NodeBalancer correctly directs traffic only to the active primary.