Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Ruby Deployments on DigitalOcean

Establishing a Highly Available PostgreSQL Cluster with Patroni

For mission-critical applications, a single PostgreSQL instance is a single point of failure. Architecting for high availability (HA) necessitates a robust failover mechanism. Patroni, a template for PostgreSQL HA, leverages distributed configuration stores like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate automatic failovers. This section details setting up a Patroni-managed PostgreSQL cluster on DigitalOcean, focusing on etcd for configuration management.

We’ll deploy three PostgreSQL nodes and two etcd nodes for a resilient setup. The PostgreSQL nodes will run Patroni, which monitors the cluster health and promotes a replica to primary when the current primary becomes unavailable.

Prerequisites

Three DigitalOcean Droplets for PostgreSQL nodes (e.g., `pg-node-1`, `pg-node-2`, `pg-node-3`).
Two DigitalOcean Droplets for etcd nodes (e.g., `etcd-1`, `etcd-2`).
SSH access to all Droplets.
Basic understanding of Linux system administration and PostgreSQL.

Setting up etcd Cluster

First, install and configure etcd on the two etcd Droplets. Ensure they can communicate with each other. For simplicity, we’ll use a basic configuration. In a production environment, consider more robust security and discovery mechanisms.

Installing etcd

On both `etcd-1` and `etcd-2` Droplets:

Download the latest etcd release from the official GitHub repository.
Extract the binaries.
Add etcd to your PATH.

Example using `wget` and `tar` (adjust version as needed):

wget https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz
tar xzvf etcd-v3.5.9-linux-amd64.tar.gz
sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/

Configuring etcd Service

Create a systemd service file for etcd. This example assumes a single-node etcd setup for simplicity during initial deployment, which will be extended to a cluster.

On `etcd-1`:

[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
User=etcd
Type=simple
ExecStart=/usr/local/bin/etcd \
  --name etcd-1 \
  --data-dir /var/lib/etcd \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://<ETCD_1_IP>:2379 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-advertise-peer-urls http://<ETCD_1_IP>:2380 \
  --initial-cluster etcd-1=http://<ETCD_1_IP>:2380,etcd-2=http://<ETCD_2_IP>:2380 \
  --initial-cluster-state new \
  --enable-pprof

[Install]
WantedBy=multi-user.target

Replace <ETCD_1_IP> and <ETCD_2_IP> with the actual private IP addresses of your etcd Droplets. Create the /var/lib/etcd directory and set ownership.

sudo mkdir -p /var/lib/etcd
sudo chown etcd:etcd /var/lib/etcd

Repeat the process for `etcd-2`, adjusting the --name and --advertise-client-urls parameters accordingly.

Starting and Verifying etcd Cluster

Start the etcd services and enable them to start on boot:

sudo systemctl daemon-reload
sudo systemctl start etcd
sudo systemctl enable etcd

Verify the cluster status by querying etcd from one of the nodes:

ETCDCTL_API=3 etcdctl member list

You should see both etcd members listed. Ensure network connectivity between etcd nodes and PostgreSQL nodes on ports 2379 (client) and 2380 (peer).

Installing and Configuring Patroni on PostgreSQL Nodes

Patroni requires Python 3 and pip. Install PostgreSQL and Patroni on each of the three PostgreSQL Droplets.

Installing PostgreSQL and Patroni

On each PostgreSQL Droplet (e.g., `pg-node-1`, `pg-node-2`, `pg-node-3`):

Install PostgreSQL (version 12+ recommended for robust replication features).
Install Patroni and its dependencies (e.g., psycopg2-binary).

Example for Ubuntu/Debian:

sudo apt update
sudo apt install postgresql postgresql-contrib python3 python3-pip -y
sudo pip3 install "patroni[etcd]" psycopg2-binary

Configuring Patroni

Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration points to the etcd cluster and defines PostgreSQL settings.

# /etc/patroni/patroni.yml
scope: my_postgres_cluster
namespace: /service/

# etcd configuration
etcd:
  host: <ETCD_1_IP>:2379, <ETCD_2_IP>:2379
  protocol: http

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  connect_address: <NODE_IP>:5432
  data_dir: /var/lib/postgresql/14/main # Adjust version as needed
  pg_hba:
    - host replication replicator <POSTGRES_SUBNET>/24 md5
    - host all all <POSTGRES_SUBNET>/24 md5
  replication:
    username: replicator
    password: <REPLICATION_PASSWORD>
  parameters:
    max_connections: 100
    shared_buffers: 128MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

# Patroni configuration
patroni:
  initdb:
    - encoding: UTF8
    - locale: en_US.UTF-8
  restapi:
    listen: 0.0.0.0:8008
    connect_address: <NODE_IP>:8008
  replication:
    synchronous_mode: false # Set to true for synchronous replication if needed
    synchronous_node_count: 0
  tags:
    nofailover: false
    clonefrom: false

# Logging
log:
  level: INFO
  dir: /var/log/patroni
  file: patroni.log

Key points:

scope: A unique identifier for this PostgreSQL cluster.
namespace: The etcd path prefix for cluster state.
etcd.host: Comma-separated list of etcd endpoints.
postgresql.connect_address: The IP address of the current node. This must be unique per node.
postgresql.data_dir: The PostgreSQL data directory. Ensure it exists and has correct permissions.
postgresql.pg_hba: Configure access for replication and application users. Replace <POSTGRES_SUBNET> with your PostgreSQL Droplets’ private subnet (e.g., 10.10.0.0/24).
postgresql.replication: Credentials for the replication user.
patroni.restapi.connect_address: The IP address for Patroni’s REST API.
patroni.tags: Control failover behavior.

Create the Patroni log directory and set permissions:

sudo mkdir -p /var/log/patroni
sudo chown postgres:postgres /var/log/patroni

Creating a Systemd Service for Patroni

Create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service) on each PostgreSQL node.

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Reload systemd, start Patroni, and enable it:

sudo systemctl daemon-reload
sudo systemctl start patroni
sudo systemctl enable patroni

Initial Cluster Bootstrap and Verification

When Patroni starts on the first node, it will initialize the PostgreSQL cluster and register itself as the primary in etcd. Subsequent nodes will detect the existing cluster and join as replicas.

Bootstraping the Primary

On one of the PostgreSQL nodes (e.g., `pg-node-1`), Patroni will attempt to initialize PostgreSQL if no data directory exists. Ensure the PostgreSQL user has the necessary permissions.

Check the Patroni logs for initialization messages:

sudo tail -f /var/log/patroni/patroni.log

You should see messages indicating it’s initializing the cluster and becoming the leader.

Joining Replicas

On the other PostgreSQL nodes (`pg-node-2`, `pg-node-3`), Patroni will detect the primary and configure PostgreSQL to replicate from it. This might involve cloning the data directory if it’s empty.

Monitor their logs for messages indicating they are joining the cluster as replicas.

Verifying Cluster Status

You can query Patroni’s REST API to check the cluster status. From any node, use `curl`:

curl http://localhost:8008/cluster

This will output a JSON object detailing the cluster state, including the current primary and its replicas. Ensure all three PostgreSQL nodes are listed and one is marked as the primary.

Automated Failover Testing

To test the automatic failover, simulate a failure of the current primary node.

Simulating Primary Failure

On the current primary node:

Stop the Patroni service: sudo systemctl stop patroni
Stop the PostgreSQL service: sudo systemctl stop postgresql
Alternatively, you can simply reboot the primary node or simulate network partition.

Observing Failover

Patroni on the remaining nodes will detect the primary’s unavailability through etcd. After a short grace period (configured in Patroni), one of the replicas will be promoted to become the new primary. This process can take 30-60 seconds depending on network latency and Patroni’s configuration.

Monitor the logs of the replica nodes. You should see messages indicating that a node has taken over as the leader.

Query the cluster status again using curl http://localhost:8008/cluster from one of the surviving nodes. The output should reflect the new primary.

Integrating with Ruby Applications

Your Ruby application needs to be aware of the HA PostgreSQL setup. Instead of connecting to a fixed primary IP, it should connect to a Virtual IP (VIP) or a DNS name that resolves to the current primary. For DigitalOcean, this can be achieved using a Floating IP.

Using DigitalOcean Floating IPs

1. **Create a Floating IP:** In your DigitalOcean control panel, create a Floating IP and assign it to one of your PostgreSQL Droplets (initially, it doesn’t matter which one, as Patroni will manage its movement).

2. **Configure Patroni to manage the Floating IP:** Patroni can be configured to automatically move the Floating IP to the current primary node. This requires the DigitalOcean API token and the Floating IP address.

Add the following to your patroni.yml on all PostgreSQL nodes:

# ... other configurations ...
dynamic_configuration:
  digitalocean:
    api_token: <YOUR_DO_API_TOKEN>
    floating_ip: <YOUR_FLOATING_IP_ADDRESS>
    region: <YOUR_DO_REGION>

Replace placeholders with your DigitalOcean API token, Floating IP address, and region (e.g., `nyc3`). Ensure the `postgres` user has network access to the DigitalOcean API.

3. **Configure your Ruby application:** Update your application’s database configuration to use the Floating IP address as the PostgreSQL host.

# config/database.yml (example for Rails)
production:
  adapter: postgresql
  encoding: unicode
  database: myapp_production
  pool: 5
  username: app_user
  password: <APP_PASSWORD>
  host: <YOUR_FLOATING_IP_ADDRESS> # Use the Floating IP here
  port: 5432

When a failover occurs, Patroni will automatically reassign the Floating IP to the new primary. Your application, connecting to the Floating IP, will seamlessly continue to operate against the new primary without code changes.

Advanced Considerations

Replication Slots: For robust replication, especially with potential network interruptions, use replication slots. Patroni can manage these.
Synchronous Replication: For zero data loss, configure synchronous_mode: true in Patroni. This requires careful tuning of synchronous_node_count and understanding the performance implications.
Monitoring: Implement comprehensive monitoring of etcd, Patroni, and PostgreSQL health. Tools like Prometheus and Grafana are excellent choices.
Backups: HA is not a substitute for backups. Implement a robust backup strategy (e.g., using pg_basebackup or dedicated tools like Barman) and store backups off-cluster.
Connection Pooling: Use a connection pooler like PgBouncer, especially for applications with high connection churn, to manage connections to the active primary efficiently.
Network Configuration: Ensure low latency and reliable network connectivity between PostgreSQL nodes and between PostgreSQL and etcd nodes. DigitalOcean’s private networking is crucial here.
Security: Secure etcd communication (TLS), use strong passwords for replication and application users, and restrict access via firewalls.