Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Ruby Deployments on DigitalOcean
Establishing a Highly Available PostgreSQL Cluster with Patroni
For mission-critical applications, a single PostgreSQL instance is a single point of failure. Architecting for high availability (HA) necessitates a robust failover mechanism. Patroni, a template for PostgreSQL HA, leverages distributed configuration stores like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate automatic failovers. This section details setting up a Patroni-managed PostgreSQL cluster on DigitalOcean, focusing on etcd for configuration management.
We’ll deploy three PostgreSQL nodes and two etcd nodes for a resilient setup. The PostgreSQL nodes will run Patroni, which monitors the cluster health and promotes a replica to primary when the current primary becomes unavailable.
Prerequisites
- Three DigitalOcean Droplets for PostgreSQL nodes (e.g., `pg-node-1`, `pg-node-2`, `pg-node-3`).
- Two DigitalOcean Droplets for etcd nodes (e.g., `etcd-1`, `etcd-2`).
- SSH access to all Droplets.
- Basic understanding of Linux system administration and PostgreSQL.
Setting up etcd Cluster
First, install and configure etcd on the two etcd Droplets. Ensure they can communicate with each other. For simplicity, we’ll use a basic configuration. In a production environment, consider more robust security and discovery mechanisms.
Installing etcd
On both `etcd-1` and `etcd-2` Droplets:
- Download the latest etcd release from the official GitHub repository.
- Extract the binaries.
- Add etcd to your PATH.
Example using `wget` and `tar` (adjust version as needed):
wget https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz tar xzvf etcd-v3.5.9-linux-amd64.tar.gz sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/
Configuring etcd Service
Create a systemd service file for etcd. This example assumes a single-node etcd setup for simplicity during initial deployment, which will be extended to a cluster.
On `etcd-1`:
[Unit] Description=etcd key-value store Documentation=https://github.com/etcd-io/etcd After=network.target [Service] User=etcd Type=simple ExecStart=/usr/local/bin/etcd \ --name etcd-1 \ --data-dir /var/lib/etcd \ --listen-client-urls http://0.0.0.0:2379 \ --advertise-client-urls http://<ETCD_1_IP>:2379 \ --listen-peer-urls http://0.0.0.0:2380 \ --initial-advertise-peer-urls http://<ETCD_1_IP>:2380 \ --initial-cluster etcd-1=http://<ETCD_1_IP>:2380,etcd-2=http://<ETCD_2_IP>:2380 \ --initial-cluster-state new \ --enable-pprof [Install] WantedBy=multi-user.target
Replace <ETCD_1_IP> and <ETCD_2_IP> with the actual private IP addresses of your etcd Droplets. Create the /var/lib/etcd directory and set ownership.
sudo mkdir -p /var/lib/etcd sudo chown etcd:etcd /var/lib/etcd
Repeat the process for `etcd-2`, adjusting the --name and --advertise-client-urls parameters accordingly.
Starting and Verifying etcd Cluster
Start the etcd services and enable them to start on boot:
sudo systemctl daemon-reload sudo systemctl start etcd sudo systemctl enable etcd
Verify the cluster status by querying etcd from one of the nodes:
ETCDCTL_API=3 etcdctl member list
You should see both etcd members listed. Ensure network connectivity between etcd nodes and PostgreSQL nodes on ports 2379 (client) and 2380 (peer).
Installing and Configuring Patroni on PostgreSQL Nodes
Patroni requires Python 3 and pip. Install PostgreSQL and Patroni on each of the three PostgreSQL Droplets.
Installing PostgreSQL and Patroni
On each PostgreSQL Droplet (e.g., `pg-node-1`, `pg-node-2`, `pg-node-3`):
- Install PostgreSQL (version 12+ recommended for robust replication features).
- Install Patroni and its dependencies (e.g.,
psycopg2-binary).
Example for Ubuntu/Debian:
sudo apt update sudo apt install postgresql postgresql-contrib python3 python3-pip -y sudo pip3 install "patroni[etcd]" psycopg2-binary
Configuring Patroni
Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration points to the etcd cluster and defines PostgreSQL settings.
# /etc/patroni/patroni.yml
scope: my_postgres_cluster
namespace: /service/
# etcd configuration
etcd:
host: <ETCD_1_IP>:2379, <ETCD_2_IP>:2379
protocol: http
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
connect_address: <NODE_IP>:5432
data_dir: /var/lib/postgresql/14/main # Adjust version as needed
pg_hba:
- host replication replicator <POSTGRES_SUBNET>/24 md5
- host all all <POSTGRES_SUBNET>/24 md5
replication:
username: replicator
password: <REPLICATION_PASSWORD>
parameters:
max_connections: 100
shared_buffers: 128MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
# Patroni configuration
patroni:
initdb:
- encoding: UTF8
- locale: en_US.UTF-8
restapi:
listen: 0.0.0.0:8008
connect_address: <NODE_IP>:8008
replication:
synchronous_mode: false # Set to true for synchronous replication if needed
synchronous_node_count: 0
tags:
nofailover: false
clonefrom: false
# Logging
log:
level: INFO
dir: /var/log/patroni
file: patroni.log
Key points:
scope: A unique identifier for this PostgreSQL cluster.namespace: The etcd path prefix for cluster state.etcd.host: Comma-separated list of etcd endpoints.postgresql.connect_address: The IP address of the current node. This must be unique per node.postgresql.data_dir: The PostgreSQL data directory. Ensure it exists and has correct permissions.postgresql.pg_hba: Configure access for replication and application users. Replace<POSTGRES_SUBNET>with your PostgreSQL Droplets’ private subnet (e.g., 10.10.0.0/24).postgresql.replication: Credentials for the replication user.patroni.restapi.connect_address: The IP address for Patroni’s REST API.patroni.tags: Control failover behavior.
Create the Patroni log directory and set permissions:
sudo mkdir -p /var/log/patroni sudo chown postgres:postgres /var/log/patroni
Creating a Systemd Service for Patroni
Create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service) on each PostgreSQL node.
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
Reload systemd, start Patroni, and enable it:
sudo systemctl daemon-reload sudo systemctl start patroni sudo systemctl enable patroni
Initial Cluster Bootstrap and Verification
When Patroni starts on the first node, it will initialize the PostgreSQL cluster and register itself as the primary in etcd. Subsequent nodes will detect the existing cluster and join as replicas.
Bootstraping the Primary
On one of the PostgreSQL nodes (e.g., `pg-node-1`), Patroni will attempt to initialize PostgreSQL if no data directory exists. Ensure the PostgreSQL user has the necessary permissions.
Check the Patroni logs for initialization messages:
sudo tail -f /var/log/patroni/patroni.log
You should see messages indicating it’s initializing the cluster and becoming the leader.
Joining Replicas
On the other PostgreSQL nodes (`pg-node-2`, `pg-node-3`), Patroni will detect the primary and configure PostgreSQL to replicate from it. This might involve cloning the data directory if it’s empty.
Monitor their logs for messages indicating they are joining the cluster as replicas.
Verifying Cluster Status
You can query Patroni’s REST API to check the cluster status. From any node, use `curl`:
curl http://localhost:8008/cluster
This will output a JSON object detailing the cluster state, including the current primary and its replicas. Ensure all three PostgreSQL nodes are listed and one is marked as the primary.
Automated Failover Testing
To test the automatic failover, simulate a failure of the current primary node.
Simulating Primary Failure
On the current primary node:
- Stop the Patroni service:
sudo systemctl stop patroni - Stop the PostgreSQL service:
sudo systemctl stop postgresql - Alternatively, you can simply reboot the primary node or simulate network partition.
Observing Failover
Patroni on the remaining nodes will detect the primary’s unavailability through etcd. After a short grace period (configured in Patroni), one of the replicas will be promoted to become the new primary. This process can take 30-60 seconds depending on network latency and Patroni’s configuration.
Monitor the logs of the replica nodes. You should see messages indicating that a node has taken over as the leader.
Query the cluster status again using curl http://localhost:8008/cluster from one of the surviving nodes. The output should reflect the new primary.
Integrating with Ruby Applications
Your Ruby application needs to be aware of the HA PostgreSQL setup. Instead of connecting to a fixed primary IP, it should connect to a Virtual IP (VIP) or a DNS name that resolves to the current primary. For DigitalOcean, this can be achieved using a Floating IP.
Using DigitalOcean Floating IPs
1. **Create a Floating IP:** In your DigitalOcean control panel, create a Floating IP and assign it to one of your PostgreSQL Droplets (initially, it doesn’t matter which one, as Patroni will manage its movement).
2. **Configure Patroni to manage the Floating IP:** Patroni can be configured to automatically move the Floating IP to the current primary node. This requires the DigitalOcean API token and the Floating IP address.
Add the following to your patroni.yml on all PostgreSQL nodes:
# ... other configurations ...
dynamic_configuration:
digitalocean:
api_token: <YOUR_DO_API_TOKEN>
floating_ip: <YOUR_FLOATING_IP_ADDRESS>
region: <YOUR_DO_REGION>
Replace placeholders with your DigitalOcean API token, Floating IP address, and region (e.g., `nyc3`). Ensure the `postgres` user has network access to the DigitalOcean API.
3. **Configure your Ruby application:** Update your application’s database configuration to use the Floating IP address as the PostgreSQL host.
# config/database.yml (example for Rails) production: adapter: postgresql encoding: unicode database: myapp_production pool: 5 username: app_user password: <APP_PASSWORD> host: <YOUR_FLOATING_IP_ADDRESS> # Use the Floating IP here port: 5432
When a failover occurs, Patroni will automatically reassign the Floating IP to the new primary. Your application, connecting to the Floating IP, will seamlessly continue to operate against the new primary without code changes.
Advanced Considerations
- Replication Slots: For robust replication, especially with potential network interruptions, use replication slots. Patroni can manage these.
- Synchronous Replication: For zero data loss, configure
synchronous_mode: truein Patroni. This requires careful tuning ofsynchronous_node_countand understanding the performance implications. - Monitoring: Implement comprehensive monitoring of etcd, Patroni, and PostgreSQL health. Tools like Prometheus and Grafana are excellent choices.
- Backups: HA is not a substitute for backups. Implement a robust backup strategy (e.g., using
pg_basebackupor dedicated tools like Barman) and store backups off-cluster. - Connection Pooling: Use a connection pooler like PgBouncer, especially for applications with high connection churn, to manage connections to the active primary efficiently.
- Network Configuration: Ensure low latency and reliable network connectivity between PostgreSQL nodes and between PostgreSQL and etcd nodes. DigitalOcean’s private networking is crucial here.
- Security: Secure etcd communication (TLS), use strong passwords for replication and application users, and restrict access via firewalls.