Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Ruby Deployments on Linode
Establishing PostgreSQL High Availability with Patroni
For robust PostgreSQL deployments, achieving automatic failover is paramount. We’ll leverage Patroni, a template for PostgreSQL high availability, which uses distributed configuration stores like etcd or Consul for leader election and configuration management. This example focuses on etcd for its simplicity and widespread adoption.
First, ensure you have PostgreSQL installed on your Linode instances. We’ll assume a setup with at least two nodes for replication and one for etcd (though etcd can be clustered for its own HA).
Installing and Configuring Patroni and etcd
On each PostgreSQL node, install Patroni and its dependencies. For etcd, install it on its dedicated node(s).
Node 1 (PostgreSQL Primary/Replica) & Node 2 (PostgreSQL Replica)
Install Patroni and python3-pip. Then, install the etcd3gw client for Patroni.
sudo apt update sudo apt install postgresql postgresql-contrib python3-pip -y pip3 install "patroni[etcd3]"
Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration points to your etcd cluster and defines PostgreSQL parameters.
[patroni] scope: my_pg_cluster name: patroni-node-1 # Change for each node restapi: listen: 0.0.0.0:8008 connect_address: 192.168.1.10:8008 # Node 1's IP etcd: host: 192.168.1.5:2379 # etcd server IP protocol: http [postgresql] listen: 0.0.0.0:5432 connect_address: 192.168.1.10:5432 # Node 1's IP data_dir: /var/lib/postgresql/14/main # Adjust version and path pg_hba: - host all all 0.0.0.0/0 md5 replication: username: replicator password: your_replication_password parameters: max_connections: 100 shared_buffers: 128MB wal_level: replica hot_standby: "on" max_wal_senders: 10 max_replication_slots: 10
Important:
- Replace
my_pg_clusterwith a unique name for your cluster. - Ensure
nameandconnect_addressare unique per node. - Adjust
data_dirto match your PostgreSQL installation. - Set a strong password for
your_replication_password. - Configure
pg_hbaappropriately for your security needs. - The
etcd.hostshould point to your etcd server’s IP address.
Node 3 (etcd Server)
Install etcd. For production, a clustered etcd setup is recommended.
sudo apt update sudo apt install etcd -y
Configure etcd by editing its configuration file (e.g., /etc/etcd/etcd.conf.yml). Ensure it’s accessible by your PostgreSQL nodes.
name: etcd-node-1 # Unique name for etcd node data-dir: /var/lib/etcd initial-advertise-peer-urls: http://192.168.1.5:2380 # etcd node IP listen-peer-urls: http://192.168.1.5:2380 # etcd node IP initial-cluster: etcd-node-1=http://192.168.1.5:2380 # List all etcd nodes advertise-client-urls: http://192.168.1.5:2379 # etcd node IP listen-client-urls: http://0.0.0.0:2379 # etcd node IP initial-cluster-state: new
Start and enable etcd:
sudo systemctl start etcd sudo systemctl enable etcd
Initializing the PostgreSQL Cluster
On one of your PostgreSQL nodes (this will be the initial primary), start Patroni. Patroni will initialize PostgreSQL, create the replication user, and configure replication.
sudo systemctl start patroni
Check the Patroni logs for initialization progress. Once the first node is up as a primary, you can start Patroni on the other PostgreSQL nodes. Patroni will detect the existing cluster in etcd and configure them as replicas.
# On Node 2 (Replica) sudo systemctl start patroni
Verify the replication status using Patroni’s REST API or by querying PostgreSQL directly.
# On any node, query Patroni API curl http://localhost:8008/cluster
The output should show the cluster members and their roles (leader/replica).
Integrating with Ruby Applications
Your Ruby application needs to be aware of the PostgreSQL cluster’s primary. Direct connection strings to a single IP will break during failover. We’ll use a load balancer or a DNS-based approach.
Option 1: Using HAProxy for Load Balancing
HAProxy can be configured to direct traffic only to the PostgreSQL primary. Patroni’s REST API can be used to query the current leader, and HAProxy can be reconfigured dynamically or periodically check the leader’s status.
Install HAProxy on a dedicated server or one of your existing nodes.
sudo apt update sudo apt install haproxy -y
Configure HAProxy (/etc/haproxy/haproxy.cfg). We’ll use a script to update the backend server list based on Patroni’s leader.
frontend pg_frontend
bind *:5432
mode tcp
default_backend pg_backend
backend pg_backend
mode tcp
balance roundrobin
option httpchk GET /primary # Patroni health check endpoint
http-check expect status 200 # Expecting 200 OK for primary
server pg_node_1 192.168.1.10:5432 check port 8008 # Node 1 IP and Patroni API port
server pg_node_2 192.168.1.11:5432 check port 8008 # Node 2 IP and Patroni API port
Note: The option httpchk and http-check expect status 200 are crucial. HAProxy will poll the /primary endpoint on Patroni’s REST API. If the node is the primary, Patroni returns a 200 OK. If it’s a replica, it returns a 404 or 405. HAProxy will mark servers returning non-200 as down.
Restart HAProxy:
sudo systemctl restart haproxy
Update your Ruby application’s database configuration to point to the HAProxy IP address on port 5432.
# config/database.yml (Rails example) production: adapter: postgresql encoding: unicode database: myapp_production pool: 5 username: app_user password: your_app_password host: 192.168.1.20 # HAProxy IP address port: 5432
Option 2: DNS-Based Failover with a Scripted Update
This approach involves updating a DNS record (e.g., an A record for db.myapp.com) to point to the IP address of the current PostgreSQL primary. This requires a DNS provider that supports API-driven record updates (like AWS Route 53, Cloudflare, or Linode’s own DNS API).
You’ll need a script that periodically queries Patroni’s REST API to determine the leader and then uses your DNS provider’s API to update the DNS record if the leader has changed.
import requests
import json
import time
import os
# Configuration
PATRONI_API_URL = "http://localhost:8008/primary" # Or the IP of a Patroni node
DNS_RECORD_NAME = "db.myapp.com"
DNS_PROVIDER = "linode" # Example: 'linode', 'cloudflare', 'aws'
LINODE_API_TOKEN = os.environ.get("LINODE_API_TOKEN")
LINODE_DOMAIN_ID = 12345 # Your Linode domain ID
LINODE_RECORD_ID = 67890 # The specific DNS record ID to update
CURRENT_LEADER_IP = None
def get_primary_ip():
try:
response = requests.get(PATRONI_API_URL, timeout=5)
response.raise_for_status()
data = response.json()
return data.get("host")
except requests.exceptions.RequestException as e:
print(f"Error fetching primary IP from Patroni: {e}")
return None
def update_linode_dns(ip_address):
if not LINODE_API_TOKEN or not LINODE_DOMAIN_ID or not LINODE_RECORD_ID:
print("Linode API token, domain ID, or record ID not configured.")
return False
url = f"https://api.linode.com/v4/domains/{LINODE_DOMAIN_ID}/records/{LINODE_RECORD_ID}"
headers = {
"Authorization": f"Bearer {LINODE_API_TOKEN}",
"Content-Type": "application/json"
}
payload = {
"address": ip_address,
"type": "A",
"name": DNS_RECORD_NAME.split('.')[0] # Linode API expects just the subdomain part
}
try:
response = requests.put(url, headers=headers, data=json.dumps(payload))
response.raise_for_status()
print(f"Successfully updated DNS record {DNS_RECORD_NAME} to {ip_address}")
return True
except requests.exceptions.RequestException as e:
print(f"Error updating Linode DNS record: {e}")
return False
def main():
global CURRENT_LEADER_IP
while True:
primary_ip = get_primary_ip()
if primary_ip and primary_ip != CURRENT_LEADER_IP:
print(f"Primary PostgreSQL node changed from {CURRENT_LEADER_IP} to {primary_ip}")
if DNS_PROVIDER == "linode":
if update_linode_dns(primary_ip):
CURRENT_LEADER_IP = primary_ip
# Add logic for other DNS providers here
else:
print(f"Unsupported DNS provider: {DNS_PROVIDER}")
elif primary_ip and primary_ip == CURRENT_LEADER_IP:
# print(f"Primary IP {primary_ip} remains unchanged.")
pass
elif not primary_ip:
print("Could not determine primary IP. Waiting...")
time.sleep(30) # Check every 30 seconds
if __name__ == "__main__":
# Ensure the script runs with the correct Patroni API endpoint if not localhost
# For example, if Patroni runs on 192.168.1.10:8008, PATRONI_API_URL should be "http://192.168.1.10:8008/primary"
# You might need to configure Patroni's restapi.listen to be accessible from where this script runs.
main()
Setup for DNS Script:
- Replace placeholder values for API tokens, domain IDs, and record IDs.
- Ensure the script has network access to both Patroni’s REST API and your DNS provider’s API.
- Set environment variables for API credentials (e.g.,
LINODE_API_TOKEN). - Run this script on a reliable server that will monitor the cluster.
- Configure your Ruby application to use the DNS name (e.g.,
db.myapp.com) in its database configuration.
Testing Failover
To test the failover mechanism:
- Gracefully stop the Patroni service on the current primary node:
sudo systemctl stop patroni. - Observe Patroni logs on the remaining nodes. Patroni will detect the leader’s absence, and one of the replicas will be promoted.
- If using HAProxy, verify that it redirects traffic to the new primary.
- If using DNS, check that the DNS record has been updated to the new primary’s IP address.
- Test application connectivity to the database using the HAProxy IP or the DNS name.
- Once the original primary is back online, Patroni will automatically configure it as a replica of the new primary.
This comprehensive setup ensures that your PostgreSQL database remains available even in the event of a node failure, minimizing downtime for your Ruby applications.