Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on DigitalOcean
Automated PostgreSQL Failover with Patroni and DigitalOcean Load Balancers
Achieving high availability for PostgreSQL databases is paramount for mission-critical applications. This section details an automated failover architecture leveraging Patroni, a template for PostgreSQL HA, and DigitalOcean’s Load Balancers for seamless traffic redirection. We’ll focus on a three-node PostgreSQL cluster (one primary, two replicas) managed by Patroni, with a DigitalOcean Load Balancer directing application traffic to the current primary.
Patroni Cluster Setup
Patroni relies on a distributed configuration store for leader election and state management. etcd is a robust choice for this. We’ll deploy a small, highly available etcd cluster (typically 3 or 5 nodes) separate from the PostgreSQL nodes for better isolation.
etcd Cluster Deployment (Example on Ubuntu 22.04)
On three separate DigitalOcean Droplets (e.g., `etcd-01`, `etcd-02`, `etcd-03`), install etcd:
First, download the latest etcd binary:
wget https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz tar xzvf etcd-v3.5.9-linux-amd64.tar.gz sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/
Create an etcd systemd service file (e.g., `/etc/systemd/system/etcd.service`):
[Unit] Description=etcd key-value store Documentation=man:etcd(1) [Service] User=etcd ExecStart=/usr/local/bin/etcd \ --name etcd-01 \ --data-dir /var/lib/etcd \ --listen-client-urls http://0.0.0.0:2379 \ --advertise-client-urls http://<ETCD_NODE_IP>:2379 \ --listen-peer-urls http://0.0.0.0:2380 \ --initial-advertise-peer-urls http://<ETCD_NODE_IP>:2380 \ --initial-cluster etcd-01=http://<ETCD_01_IP>:2380,etcd-02=http://<ETCD_02_IP>:2380,etcd-03=http://<ETCD_03_IP>:2380 \ --initial-cluster-state new \ --discovery-fallback false \ --enable-pprof [Install] Remains=multi-user.target
Replace <ETCD_NODE_IP>, <ETCD_01_IP>, <ETCD_02_IP>, and <ETCD_03_IP> with the actual private IP addresses of your etcd nodes. Ensure the --name parameter is unique for each node. Create the etcd user and data directory:
sudo groupadd --system etcd sudo useradd -s /sbin/nologin --system -g etcd etcd sudo mkdir -p /var/lib/etcd sudo chown etcd:etcd /var/lib/etcd sudo systemctl daemon-reload sudo systemctl enable etcd sudo systemctl start etcd
Repeat this process on each etcd node, adjusting the --name and IP addresses accordingly. Verify cluster health:
ETCDCTL_API=3 etcdctl member list
PostgreSQL Cluster with Patroni
On your three PostgreSQL Droplets (e.g., `pg-01`, `pg-02`, `pg-03`), install PostgreSQL and Patroni. We’ll use a configuration file for Patroni.
Install PostgreSQL (e.g., version 15):
sudo apt update sudo apt install postgresql postgresql-contrib -y
Install Patroni and its dependencies (e.g., using pip):
sudo apt install python3-pip python3-dev python3-venv -y pip3 install --upgrade pip pip3 install patroni[etcd] psycopg2-binary
Create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`) on each PostgreSQL node. Ensure the scope is identical across all nodes, representing your cluster name.
# /etc/patroni/patroni.yml
scope: my_pg_cluster
namespace: /service/
name: <NODE_HOSTNAME> # e.g., pg-01, pg-02, pg-03
restapi:
listen: 0.0.0.0:8000
connect_address: <NODE_IP>:8000 # e.g., 10.10.0.1:8000
etcd:
host: <ETCD_01_IP>:2379, <ETCD_02_IP>:2379, <ETCD_03_IP>:2379
protocol: http
postgresql:
listen: 0.0.0.0:5432
connect_address: <NODE_IP>:5432 # e.g., 10.10.0.1:5432
data_dir: /var/lib/postgresql/15/main
pg_hba:
- host all all 0.0.0.0/0 md5
parameters:
max_connections: 100
shared_buffers: 256MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
replication:
replication_slots: true
synchronous_mode: false # Set to true for synchronous replication if latency allows
tags:
nofailover: false
clonefrom: false
Replace <NODE_HOSTNAME> with the actual hostname of the Droplet, <NODE_IP> with its private IP, and <ETCD_X_IP> with the etcd node IPs. Ensure the data_dir matches your PostgreSQL installation. Create the Patroni systemd service file (e.g., `/etc/systemd/system/patroni.service`):
[Unit] Description=Patroni PostgreSQL High-Availability Manager After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure KillMode=process [Install] WantedBy=multi-user.target
Enable and start Patroni on all PostgreSQL nodes:
sudo systemctl daemon-reload sudo systemctl enable patroni sudo systemctl start patroni
After starting, Patroni will initialize the primary and replicas. You can check the status via the Patroni REST API or by inspecting the PostgreSQL logs. The first node to acquire the leader lock in etcd will become the primary.
DigitalOcean Load Balancer Configuration
Create a DigitalOcean Load Balancer. Configure it to forward TCP traffic on port 5432 to your PostgreSQL nodes. Crucially, set up a health check that targets the Patroni REST API health endpoint.
1. Create Load Balancer: In the DigitalOcean control panel, navigate to Networking -> Load Balancers and click “Create Load Balancer”.
2. Add Droplets: Select your three PostgreSQL Droplets (`pg-01`, `pg-02`, `pg-03`).
3. Configure Frontend:
- Protocol: TCP
- Port: 5432
4. Configure Backend:
- Protocol: TCP
- Port: 5432
- Health Check:
- Protocol: HTTP
- Port: 8000
- Path:
/primary - Check Interval: 10s
- Response Timeout: 5s
- Healthy Threshold: 3
- Unhealthy Threshold: 3
The health check on port 8000, path /primary, is key. Patroni’s REST API returns a 200 OK with a JSON body containing the primary’s information when the node is the primary. If a node is a replica or unavailable, it will return a different status code or an empty response, causing the Load Balancer to mark it as unhealthy. This ensures traffic is only sent to the active primary.
Testing Failover
To simulate a failure:
- Stop Patroni on the Primary: SSH into the current primary node and run
sudo systemctl stop patroni. - Observe Load Balancer: The Load Balancer’s health check for the stopped node will fail.
- Patroni Re-election: Patroni will detect the loss of the leader lock in etcd and initiate a new election. One of the replicas will be promoted to primary.
- Application Connectivity: Your application, connecting via the Load Balancer’s IP address, will automatically be directed to the new primary after a brief period (determined by Load Balancer and Patroni’s election/startup times).
You can also test by rebooting the primary node. Patroni will automatically restart and rejoin the cluster, potentially becoming primary again if it wins the election.
Integrating Perl Applications with PostgreSQL HA
Perl applications typically interact with PostgreSQL using modules like DBI and DBD::Pg. To ensure your Perl application seamlessly connects to the highly available PostgreSQL cluster, you need to configure your connection string to use the DigitalOcean Load Balancer’s IP address.
Connection String Configuration
Instead of hardcoding the IP address of a specific PostgreSQL instance, use the IP address assigned to your DigitalOcean Load Balancer. This IP address is static and will always point to the current active PostgreSQL primary.
<?php
// Example using PDO in PHP, often used by Perl developers or in mixed environments
// For pure Perl, the DBI connection string would be similar.
$db_host = 'YOUR_DO_LOAD_BALANCER_IP'; // e.g., '165.227.100.200'
$db_port = 5432;
$db_name = 'your_database';
$db_user = 'your_user';
$db_pass = 'your_password';
try {
$dsn = "pgsql:host={$db_host};port={$db_port};dbname={$db_name}";
$dbh = new PDO($dsn, $db_user, $db_pass);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
echo "Connected successfully to PostgreSQL via Load Balancer!\n";
// Example query
$stmt = $dbh->query("SELECT version()");
$row = $stmt->fetch(PDO::FETCH_ASSOC);
print_r($row);
} catch (PDOException $e) {
echo "Error connecting to database: " . $e->getMessage() . "\n";
// Implement retry logic or error handling here
}
?>
In a pure Perl environment, the DBI connection string would look like this:
use DBI;
my $db_host = 'YOUR_DO_LOAD_BALANCER_IP'; # e.g., '165.227.100.200'
my $db_port = 5432;
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';
my $dsn = "DBI:Pg:host=$db_host;port=$db_port;dbname=$db_name";
my $dbh;
eval {
$dbh = DBI->connect($dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1 });
print "Connected successfully to PostgreSQL via Load Balancer!\n";
# Example query
my $sth = $dbh->prepare("SELECT version()");
$sth->execute();
my $row = $sth->fetchrow_hashref();
use Data::Dumper;
print Dumper($row);
$dbh->disconnect();
};
if ($@) {
print "Error connecting to database: $@\n";
# Implement retry logic or error handling here
}
The key takeaway is to abstract the database endpoint. The Load Balancer IP becomes the single point of access for your application.
Handling Transient Errors and Retries
During a failover event, there’s a small window where the old primary is unavailable, and the new primary is being promoted and initialized. Your application might encounter connection errors during this brief period. Robust applications should implement retry logic.
For Perl applications using DBI, you can implement a retry loop around your database connection and query execution. A common pattern involves:
- Attempting the database operation.
- Catching specific database exceptions (e.g., connection refused, timeout).
- If an exception occurs, waiting for a short, randomized interval (exponential backoff is recommended).
- Retrying the operation up to a maximum number of attempts.
use DBI;
use Time::HiRes qw(sleep);
use POSIX qw(strftime);
my $db_host = 'YOUR_DO_LOAD_BALANCER_IP';
my $db_port = 5432;
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';
my $max_retries = 5;
my $base_delay = 2; # seconds
my $dbh;
for (my $attempt = 1; $attempt <= $max_retries; $attempt++) {
my $dsn = "DBI:Pg:host=$db_host;port=$db_port;dbname=$db_name";
eval {
$dbh = DBI->connect($dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1 });
print strftime("%Y-%m-%d %H:%M:%S", localtime) . " Connected successfully on attempt $attempt.\n";
# Perform your database operations here
my $sth = $dbh->prepare("SELECT 1");
$sth->execute();
my $result = $sth->fetchrow_array();
print "Query successful: $result\n";
$dbh->disconnect();
last; # Exit loop on success
};
if ($@) {
print strftime("%Y-%m-%d %H:%M:%S", localtime) . " Connection failed on attempt $attempt: $@\n";
if ($attempt == $max_retries) {
die "Failed to connect to database after $max_retries attempts.\n";
}
my $delay = $base_delay * (2 ** ($attempt - 1)) + rand(1); # Exponential backoff with jitter
print "Retrying in $delay seconds...\n";
sleep($delay);
}
}
This script demonstrates a basic retry mechanism. In a production Perl application, you would integrate this logic within your data access layer or service functions.
Monitoring and Alerting
Effective disaster recovery relies on proactive monitoring. Key metrics to track include:
- Patroni Cluster Health: Monitor the Patroni REST API endpoints on each node. Check for the presence of a primary and the health of replicas.
- etcd Cluster Health: Ensure the etcd cluster is healthy and all members are reachable.
- PostgreSQL Replication Lag: Monitor
pg_stat_replicationon the primary to ensure replicas are keeping up. - Load Balancer Health Checks: DigitalOcean provides metrics on load balancer health check success/failure rates.
- Application Error Rates: Monitor your application logs for increased database connection errors, which might indicate failover events or underlying issues.
Tools like Prometheus with exporters (e.g., postgres_exporter, etcd_exporter) and Alertmanager can be integrated to provide comprehensive monitoring and automated alerts for critical events, such as a prolonged primary unavailability or failed failover attempts.