Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Ruby Deployments on Linode

Establishing PostgreSQL High Availability with Patroni

For robust PostgreSQL deployments, achieving automatic failover is paramount. We’ll leverage Patroni, a template for PostgreSQL high availability, which uses distributed configuration stores like etcd or Consul for leader election and configuration management. This example focuses on etcd for its simplicity and widespread adoption.

First, ensure you have PostgreSQL installed on your Linode instances. We’ll assume a setup with at least two nodes for replication and one for etcd (though etcd can be clustered for its own HA).

Installing and Configuring Patroni and etcd

On each PostgreSQL node, install Patroni and its dependencies. For etcd, install it on its dedicated node(s).

Node 1 (PostgreSQL Primary/Replica) & Node 2 (PostgreSQL Replica)

Install Patroni and python3-pip. Then, install the etcd3gw client for Patroni.

sudo apt update
sudo apt install postgresql postgresql-contrib python3-pip -y
pip3 install "patroni[etcd3]"

Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration points to your etcd cluster and defines PostgreSQL parameters.

[patroni]
scope: my_pg_cluster
name: patroni-node-1 # Change for each node
restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.1.10:8008 # Node 1's IP
etcd:
  host: 192.168.1.5:2379 # etcd server IP
  protocol: http

[postgresql]
listen: 0.0.0.0:5432
connect_address: 192.168.1.10:5432 # Node 1's IP
data_dir: /var/lib/postgresql/14/main # Adjust version and path
pg_hba:
  - host all all 0.0.0.0/0 md5
replication:
  username: replicator
  password: your_replication_password
parameters:
  max_connections: 100
  shared_buffers: 128MB
  wal_level: replica
  hot_standby: "on"
  max_wal_senders: 10
  max_replication_slots: 10

Important:

Replace my_pg_cluster with a unique name for your cluster.
Ensure name and connect_address are unique per node.
Adjust data_dir to match your PostgreSQL installation.
Set a strong password for your_replication_password.
Configure pg_hba appropriately for your security needs.
The etcd.host should point to your etcd server’s IP address.

Node 3 (etcd Server)

Install etcd. For production, a clustered etcd setup is recommended.

sudo apt update
sudo apt install etcd -y

Configure etcd by editing its configuration file (e.g., /etc/etcd/etcd.conf.yml). Ensure it’s accessible by your PostgreSQL nodes.

name: etcd-node-1 # Unique name for etcd node
data-dir: /var/lib/etcd

initial-advertise-peer-urls: http://192.168.1.5:2380 # etcd node IP
listen-peer-urls: http://192.168.1.5:2380 # etcd node IP

initial-cluster: etcd-node-1=http://192.168.1.5:2380 # List all etcd nodes
advertise-client-urls: http://192.168.1.5:2379 # etcd node IP
listen-client-urls: http://0.0.0.0:2379 # etcd node IP

initial-cluster-state: new

Start and enable etcd:

sudo systemctl start etcd
sudo systemctl enable etcd

Initializing the PostgreSQL Cluster

On one of your PostgreSQL nodes (this will be the initial primary), start Patroni. Patroni will initialize PostgreSQL, create the replication user, and configure replication.

sudo systemctl start patroni

Check the Patroni logs for initialization progress. Once the first node is up as a primary, you can start Patroni on the other PostgreSQL nodes. Patroni will detect the existing cluster in etcd and configure them as replicas.

# On Node 2 (Replica)
sudo systemctl start patroni

Verify the replication status using Patroni’s REST API or by querying PostgreSQL directly.

# On any node, query Patroni API
curl http://localhost:8008/cluster

The output should show the cluster members and their roles (leader/replica).

Integrating with Ruby Applications

Your Ruby application needs to be aware of the PostgreSQL cluster’s primary. Direct connection strings to a single IP will break during failover. We’ll use a load balancer or a DNS-based approach.

Option 1: Using HAProxy for Load Balancing

HAProxy can be configured to direct traffic only to the PostgreSQL primary. Patroni’s REST API can be used to query the current leader, and HAProxy can be reconfigured dynamically or periodically check the leader’s status.

Install HAProxy on a dedicated server or one of your existing nodes.

sudo apt update
sudo apt install haproxy -y

Configure HAProxy (/etc/haproxy/haproxy.cfg). We’ll use a script to update the backend server list based on Patroni’s leader.

frontend pg_frontend
    bind *:5432
    mode tcp
    default_backend pg_backend

backend pg_backend
    mode tcp
    balance roundrobin
    option httpchk GET /primary # Patroni health check endpoint
    http-check expect status 200 # Expecting 200 OK for primary
    server pg_node_1 192.168.1.10:5432 check port 8008 # Node 1 IP and Patroni API port
    server pg_node_2 192.168.1.11:5432 check port 8008 # Node 2 IP and Patroni API port

Note: The option httpchk and http-check expect status 200 are crucial. HAProxy will poll the /primary endpoint on Patroni’s REST API. If the node is the primary, Patroni returns a 200 OK. If it’s a replica, it returns a 404 or 405. HAProxy will mark servers returning non-200 as down.

Restart HAProxy:

sudo systemctl restart haproxy

Update your Ruby application’s database configuration to point to the HAProxy IP address on port 5432.

# config/database.yml (Rails example)
production:
  adapter: postgresql
  encoding: unicode
  database: myapp_production
  pool: 5
  username: app_user
  password: your_app_password
  host: 192.168.1.20 # HAProxy IP address
  port: 5432

Option 2: DNS-Based Failover with a Scripted Update

This approach involves updating a DNS record (e.g., an A record for db.myapp.com) to point to the IP address of the current PostgreSQL primary. This requires a DNS provider that supports API-driven record updates (like AWS Route 53, Cloudflare, or Linode’s own DNS API).

You’ll need a script that periodically queries Patroni’s REST API to determine the leader and then uses your DNS provider’s API to update the DNS record if the leader has changed.

import requests
import json
import time
import os

# Configuration
PATRONI_API_URL = "http://localhost:8008/primary" # Or the IP of a Patroni node
DNS_RECORD_NAME = "db.myapp.com"
DNS_PROVIDER = "linode" # Example: 'linode', 'cloudflare', 'aws'
LINODE_API_TOKEN = os.environ.get("LINODE_API_TOKEN")
LINODE_DOMAIN_ID = 12345 # Your Linode domain ID
LINODE_RECORD_ID = 67890 # The specific DNS record ID to update

CURRENT_LEADER_IP = None

def get_primary_ip():
    try:
        response = requests.get(PATRONI_API_URL, timeout=5)
        response.raise_for_status()
        data = response.json()
        return data.get("host")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching primary IP from Patroni: {e}")
        return None

def update_linode_dns(ip_address):
    if not LINODE_API_TOKEN or not LINODE_DOMAIN_ID or not LINODE_RECORD_ID:
        print("Linode API token, domain ID, or record ID not configured.")
        return False

    url = f"https://api.linode.com/v4/domains/{LINODE_DOMAIN_ID}/records/{LINODE_RECORD_ID}"
    headers = {
        "Authorization": f"Bearer {LINODE_API_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "address": ip_address,
        "type": "A",
        "name": DNS_RECORD_NAME.split('.')[0] # Linode API expects just the subdomain part
    }

    try:
        response = requests.put(url, headers=headers, data=json.dumps(payload))
        response.raise_for_status()
        print(f"Successfully updated DNS record {DNS_RECORD_NAME} to {ip_address}")
        return True
    except requests.exceptions.RequestException as e:
        print(f"Error updating Linode DNS record: {e}")
        return False

def main():
    global CURRENT_LEADER_IP
    while True:
        primary_ip = get_primary_ip()
        if primary_ip and primary_ip != CURRENT_LEADER_IP:
            print(f"Primary PostgreSQL node changed from {CURRENT_LEADER_IP} to {primary_ip}")
            if DNS_PROVIDER == "linode":
                if update_linode_dns(primary_ip):
                    CURRENT_LEADER_IP = primary_ip
            # Add logic for other DNS providers here
            else:
                print(f"Unsupported DNS provider: {DNS_PROVIDER}")
        elif primary_ip and primary_ip == CURRENT_LEADER_IP:
            # print(f"Primary IP {primary_ip} remains unchanged.")
            pass
        elif not primary_ip:
            print("Could not determine primary IP. Waiting...")

        time.sleep(30) # Check every 30 seconds

if __name__ == "__main__":
    # Ensure the script runs with the correct Patroni API endpoint if not localhost
    # For example, if Patroni runs on 192.168.1.10:8008, PATRONI_API_URL should be "http://192.168.1.10:8008/primary"
    # You might need to configure Patroni's restapi.listen to be accessible from where this script runs.
    main()

Setup for DNS Script:

Replace placeholder values for API tokens, domain IDs, and record IDs.
Ensure the script has network access to both Patroni’s REST API and your DNS provider’s API.
Set environment variables for API credentials (e.g., LINODE_API_TOKEN).
Run this script on a reliable server that will monitor the cluster.
Configure your Ruby application to use the DNS name (e.g., db.myapp.com) in its database configuration.

Testing Failover

To test the failover mechanism:

Gracefully stop the Patroni service on the current primary node: sudo systemctl stop patroni.
Observe Patroni logs on the remaining nodes. Patroni will detect the leader’s absence, and one of the replicas will be promoted.
If using HAProxy, verify that it redirects traffic to the new primary.
If using DNS, check that the DNS record has been updated to the new primary’s IP address.
Test application connectivity to the database using the HAProxy IP or the DNS name.
Once the original primary is back online, Patroni will automatically configure it as a replica of the new primary.

This comprehensive setup ensures that your PostgreSQL database remains available even in the event of a node failure, minimizing downtime for your Ruby applications.