Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on DigitalOcean

Automated PostgreSQL Failover with Patroni and DigitalOcean Load Balancers

Achieving high availability for PostgreSQL databases is paramount for mission-critical applications. This section details an automated failover architecture leveraging Patroni, a template for PostgreSQL HA, and DigitalOcean’s Load Balancers for seamless traffic redirection. We’ll focus on a three-node PostgreSQL cluster (one primary, two replicas) managed by Patroni, with a DigitalOcean Load Balancer directing application traffic to the current primary.

Patroni Cluster Setup

Patroni relies on a distributed configuration store for leader election and state management. etcd is a robust choice for this. We’ll deploy a small, highly available etcd cluster (typically 3 or 5 nodes) separate from the PostgreSQL nodes for better isolation.

etcd Cluster Deployment (Example on Ubuntu 22.04)

On three separate DigitalOcean Droplets (e.g., `etcd-01`, `etcd-02`, `etcd-03`), install etcd:

First, download the latest etcd binary:

wget https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz
tar xzvf etcd-v3.5.9-linux-amd64.tar.gz
sudo mv etcd-v3.5.9-linux-amd64/etcd* /usr/local/bin/

Create an etcd systemd service file (e.g., `/etc/systemd/system/etcd.service`):

[Unit]
Description=etcd key-value store
Documentation=man:etcd(1)

[Service]
User=etcd
ExecStart=/usr/local/bin/etcd \
  --name etcd-01 \
  --data-dir /var/lib/etcd \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://<ETCD_NODE_IP>:2379 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-advertise-peer-urls http://<ETCD_NODE_IP>:2380 \
  --initial-cluster etcd-01=http://<ETCD_01_IP>:2380,etcd-02=http://<ETCD_02_IP>:2380,etcd-03=http://<ETCD_03_IP>:2380 \
  --initial-cluster-state new \
  --discovery-fallback false \
  --enable-pprof

[Install]
Remains=multi-user.target

Replace <ETCD_NODE_IP>, <ETCD_01_IP>, <ETCD_02_IP>, and <ETCD_03_IP> with the actual private IP addresses of your etcd nodes. Ensure the --name parameter is unique for each node. Create the etcd user and data directory:

sudo groupadd --system etcd
sudo useradd -s /sbin/nologin --system -g etcd etcd
sudo mkdir -p /var/lib/etcd
sudo chown etcd:etcd /var/lib/etcd
sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd

Repeat this process on each etcd node, adjusting the --name and IP addresses accordingly. Verify cluster health:

ETCDCTL_API=3 etcdctl member list

PostgreSQL Cluster with Patroni

On your three PostgreSQL Droplets (e.g., `pg-01`, `pg-02`, `pg-03`), install PostgreSQL and Patroni. We’ll use a configuration file for Patroni.

Install PostgreSQL (e.g., version 15):

sudo apt update
sudo apt install postgresql postgresql-contrib -y

Install Patroni and its dependencies (e.g., using pip):

sudo apt install python3-pip python3-dev python3-venv -y
pip3 install --upgrade pip
pip3 install patroni[etcd] psycopg2-binary

Create a Patroni configuration file (e.g., `/etc/patroni/patroni.yml`) on each PostgreSQL node. Ensure the scope is identical across all nodes, representing your cluster name.

# /etc/patroni/patroni.yml
scope: my_pg_cluster
namespace: /service/
name: <NODE_HOSTNAME> # e.g., pg-01, pg-02, pg-03

restapi:
  listen: 0.0.0.0:8000
  connect_address: <NODE_IP>:8000 # e.g., 10.10.0.1:8000

etcd:
  host: <ETCD_01_IP>:2379, <ETCD_02_IP>:2379, <ETCD_03_IP>:2379
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  connect_address: <NODE_IP>:5432 # e.g., 10.10.0.1:5432
  data_dir: /var/lib/postgresql/15/main
  pg_hba:
    - host    all             all             0.0.0.0/0               md5
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

replication:
  replication_slots: true
  synchronous_mode: false # Set to true for synchronous replication if latency allows

tags:
  nofailover: false
  clonefrom: false

Replace <NODE_HOSTNAME> with the actual hostname of the Droplet, <NODE_IP> with its private IP, and <ETCD_X_IP> with the etcd node IPs. Ensure the data_dir matches your PostgreSQL installation. Create the Patroni systemd service file (e.g., `/etc/systemd/system/patroni.service`):

[Unit]
Description=Patroni PostgreSQL High-Availability Manager
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
KillMode=process

[Install]
WantedBy=multi-user.target

Enable and start Patroni on all PostgreSQL nodes:

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

After starting, Patroni will initialize the primary and replicas. You can check the status via the Patroni REST API or by inspecting the PostgreSQL logs. The first node to acquire the leader lock in etcd will become the primary.

DigitalOcean Load Balancer Configuration

Create a DigitalOcean Load Balancer. Configure it to forward TCP traffic on port 5432 to your PostgreSQL nodes. Crucially, set up a health check that targets the Patroni REST API health endpoint.

1. Create Load Balancer: In the DigitalOcean control panel, navigate to Networking -> Load Balancers and click “Create Load Balancer”.

2. Add Droplets: Select your three PostgreSQL Droplets (`pg-01`, `pg-02`, `pg-03`).

3. Configure Frontend:

Protocol: TCP
Port: 5432

4. Configure Backend:

Protocol: TCP
Port: 5432
Health Check:

Protocol: HTTP
Port: 8000
Path: /primary
Check Interval: 10s
Response Timeout: 5s
Healthy Threshold: 3
Unhealthy Threshold: 3

The health check on port 8000, path /primary, is key. Patroni’s REST API returns a 200 OK with a JSON body containing the primary’s information when the node is the primary. If a node is a replica or unavailable, it will return a different status code or an empty response, causing the Load Balancer to mark it as unhealthy. This ensures traffic is only sent to the active primary.

Testing Failover

To simulate a failure:

Stop Patroni on the Primary: SSH into the current primary node and run sudo systemctl stop patroni.
Observe Load Balancer: The Load Balancer’s health check for the stopped node will fail.
Patroni Re-election: Patroni will detect the loss of the leader lock in etcd and initiate a new election. One of the replicas will be promoted to primary.
Application Connectivity: Your application, connecting via the Load Balancer’s IP address, will automatically be directed to the new primary after a brief period (determined by Load Balancer and Patroni’s election/startup times).

You can also test by rebooting the primary node. Patroni will automatically restart and rejoin the cluster, potentially becoming primary again if it wins the election.

Integrating Perl Applications with PostgreSQL HA

Perl applications typically interact with PostgreSQL using modules like DBI and DBD::Pg. To ensure your Perl application seamlessly connects to the highly available PostgreSQL cluster, you need to configure your connection string to use the DigitalOcean Load Balancer’s IP address.

Connection String Configuration

Instead of hardcoding the IP address of a specific PostgreSQL instance, use the IP address assigned to your DigitalOcean Load Balancer. This IP address is static and will always point to the current active PostgreSQL primary.

<?php
// Example using PDO in PHP, often used by Perl developers or in mixed environments
// For pure Perl, the DBI connection string would be similar.

$db_host = 'YOUR_DO_LOAD_BALANCER_IP'; // e.g., '165.227.100.200'
$db_port = 5432;
$db_name = 'your_database';
$db_user = 'your_user';
$db_pass = 'your_password';

try {
    $dsn = "pgsql:host={$db_host};port={$db_port};dbname={$db_name}";
    $dbh = new PDO($dsn, $db_user, $db_pass);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    echo "Connected successfully to PostgreSQL via Load Balancer!\n";

    // Example query
    $stmt = $dbh->query("SELECT version()");
    $row = $stmt->fetch(PDO::FETCH_ASSOC);
    print_r($row);

} catch (PDOException $e) {
    echo "Error connecting to database: " . $e->getMessage() . "\n";
    // Implement retry logic or error handling here
}
?>

In a pure Perl environment, the DBI connection string would look like this:

use DBI;

my $db_host = 'YOUR_DO_LOAD_BALANCER_IP'; # e.g., '165.227.100.200'
my $db_port = 5432;
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';

my $dsn = "DBI:Pg:host=$db_host;port=$db_port;dbname=$db_name";
my $dbh;

eval {
    $dbh = DBI->connect($dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1 });
    print "Connected successfully to PostgreSQL via Load Balancer!\n";

    # Example query
    my $sth = $dbh->prepare("SELECT version()");
    $sth->execute();
    my $row = $sth->fetchrow_hashref();
    use Data::Dumper;
    print Dumper($row);

    $dbh->disconnect();
};
if ($@) {
    print "Error connecting to database: $@\n";
    # Implement retry logic or error handling here
}

The key takeaway is to abstract the database endpoint. The Load Balancer IP becomes the single point of access for your application.

Handling Transient Errors and Retries

During a failover event, there’s a small window where the old primary is unavailable, and the new primary is being promoted and initialized. Your application might encounter connection errors during this brief period. Robust applications should implement retry logic.

For Perl applications using DBI, you can implement a retry loop around your database connection and query execution. A common pattern involves:

Attempting the database operation.
Catching specific database exceptions (e.g., connection refused, timeout).
If an exception occurs, waiting for a short, randomized interval (exponential backoff is recommended).
Retrying the operation up to a maximum number of attempts.

use DBI;
use Time::HiRes qw(sleep);
use POSIX qw(strftime);

my $db_host = 'YOUR_DO_LOAD_BALANCER_IP';
my $db_port = 5432;
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';

my $max_retries = 5;
my $base_delay = 2; # seconds
my $dbh;

for (my $attempt = 1; $attempt <= $max_retries; $attempt++) {
    my $dsn = "DBI:Pg:host=$db_host;port=$db_port;dbname=$db_name";
    eval {
        $dbh = DBI->connect($dsn, $db_user, $db_pass, { RaiseError => 1, AutoCommit => 1 });
        print strftime("%Y-%m-%d %H:%M:%S", localtime) . " Connected successfully on attempt $attempt.\n";

        # Perform your database operations here
        my $sth = $dbh->prepare("SELECT 1");
        $sth->execute();
        my $result = $sth->fetchrow_array();
        print "Query successful: $result\n";

        $dbh->disconnect();
        last; # Exit loop on success
    };
    if ($@) {
        print strftime("%Y-%m-%d %H:%M:%S", localtime) . " Connection failed on attempt $attempt: $@\n";
        if ($attempt == $max_retries) {
            die "Failed to connect to database after $max_retries attempts.\n";
        }
        my $delay = $base_delay * (2 ** ($attempt - 1)) + rand(1); # Exponential backoff with jitter
        print "Retrying in $delay seconds...\n";
        sleep($delay);
    }
}

This script demonstrates a basic retry mechanism. In a production Perl application, you would integrate this logic within your data access layer or service functions.

Monitoring and Alerting

Effective disaster recovery relies on proactive monitoring. Key metrics to track include:

Patroni Cluster Health: Monitor the Patroni REST API endpoints on each node. Check for the presence of a primary and the health of replicas.
etcd Cluster Health: Ensure the etcd cluster is healthy and all members are reachable.
PostgreSQL Replication Lag: Monitor pg_stat_replication on the primary to ensure replicas are keeping up.
Load Balancer Health Checks: DigitalOcean provides metrics on load balancer health check success/failure rates.
Application Error Rates: Monitor your application logs for increased database connection errors, which might indicate failover events or underlying issues.

Tools like Prometheus with exporters (e.g., postgres_exporter, etcd_exporter) and Alertmanager can be integrated to provide comprehensive monitoring and automated alerts for critical events, such as a prolonged primary unavailability or failed failover attempts.