Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on Linode

Establishing a High-Availability PostgreSQL Cluster with Patroni

For critical applications, a single PostgreSQL instance is a single point of failure. Architecting for high availability (HA) necessitates a robust failover mechanism. Patroni, a template for PostgreSQL HA, leverages distributed configuration stores like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate automated failovers. This section details setting up a Patroni-managed PostgreSQL cluster on Linode, focusing on etcd for state management.

We’ll deploy three Linode instances. One will host etcd, and the other two will run PostgreSQL nodes managed by Patroni. For simplicity, we’ll use Debian 11 (Bullseye) on all instances.

1. Setting up etcd Cluster

A highly available etcd cluster is paramount. For a minimal setup, three etcd nodes are recommended. For this example, we’ll consolidate etcd onto a single Linode instance for simplicity, acknowledging that a production deployment would distribute etcd across multiple nodes for its own HA. Install etcd on your designated Linode:

First, add the etcd repository and install the package:

curl -L https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://packages.cloud.google.com/apt/ cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
apt-get update
apt-get install etcd -y

Next, configure etcd. The default configuration is often sufficient for initial testing, but for production, you’ll want to secure it with TLS and configure peer discovery. For this example, we’ll use a basic configuration. Ensure the etcd service is enabled and started:

systemctl enable etcd
systemctl start etcd

Verify etcd is running and accessible:

etcdctl member list

If you are running etcd on a separate node, ensure its firewall allows traffic on ports 2379 (client API) and 2380 (peer API).

2. Installing and Configuring Patroni

Patroni is typically installed via pip. Ensure you have Python 3 and pip installed on your PostgreSQL nodes.

apt-get update
apt-get install python3-pip python3-dev build-essential libpq-dev postgresql-server-dev-13 -y
pip3 install --upgrade pip
pip3 install "patroni[etcd]" psycopg2-binary

Now, create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration tells Patroni how to connect to etcd, how to manage PostgreSQL, and replication settings.

# /etc/patroni/patroni.yml
scope: my_pg_cluster
namespace: /service/ # Optional: prefix for keys in etcd

restapi:
  listen: 0.0.0.0:8008
  connect_address: &connect_address :8008

etcd:
  host: :2379
  protocol: http # Use https for TLS secured etcd

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  connect_address: *connect_address
  data_dir: /var/lib/postgresql/13/main
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5
  replication:
    username: replicator
    password: your_replication_password
    ssl: false
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

# Tags for node identification
tags:
  nofailover: false
  clonefrom: false

# Bootstrap configuration (only for the initial primary)
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 1MB
    postgresql:
      use_pg_rewind: true
      parameters:
        recovery_target_timeline: "latest"
        hot_standby: "on"
        max_connections: 100
        shared_buffers: 256MB
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
      replication_slots:
        - name: patroni_slot_1
          database: template1
          plugin: pgoutput
        - name: patroni_slot_2
          database: template1
          plugin: pgoutput
      users:
        replicator:
          password: your_replication_password
          options:
            - REPLICATION
        admin:
          password: your_admin_password
          options:
            - SUPERUSER
            - CREATEDB
            - CREATEROLE
            - REPLICATION
            - LOGIN

# Patroni REST API configuration
# This is for the Patroni API itself, not the PostgreSQL API
# If you want to expose Patroni API to other services, configure it here
# api:
#   listen: 0.0.0.0:8008
#   ssl: false

# Logging configuration
log:
  level: INFO
  dir: /var/log/patroni
  file: patroni.log

Replace <IP_ADDRESS_OF_THIS_NODE> with the actual IP address of the Linode instance where this configuration is being applied. Also, replace <ETCD_NODE_IP> with the IP address of your etcd node. Ensure the PostgreSQL data directory (/var/lib/postgresql/13/main) exists and has the correct ownership (typically postgres:postgres).

Create the log directory and set permissions:

mkdir -p /var/log/patroni
chown -R postgres:postgres /var/log/patroni

Now, create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service):

[Unit]
Description=Patroni PostgreSQL High-Availability Manager
After=network.target

[Service]
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
KillMode=process
TimeoutSec=30

[Install]
WantedBy=multi-user.target

Enable and start the Patroni service on all PostgreSQL nodes:

systemctl daemon-reload
systemctl enable patroni
systemctl start patroni

Monitor the Patroni logs (/var/log/patroni/patroni.log) for any errors. The first node to start will bootstrap the cluster, initializing PostgreSQL and creating the necessary replication user and slots. Subsequent nodes will join as replicas.

3. Verifying Cluster Status and Failover

You can check the cluster status by querying the Patroni REST API. On any node, you can use curl:

curl http://localhost:8008/cluster

This will output JSON detailing the cluster state, including the current primary and replicas. To test failover, you can stop the Patroni service on the current primary node:

systemctl stop patroni

Within a short period (determined by the ttl and loop_wait in your patroni.yml), Patroni will detect the primary’s absence, elect a new primary from the available replicas, and reconfigure the remaining replicas. You can observe this process in the logs of the remaining nodes. After the failover, query the cluster status again to confirm the new primary.

For applications connecting to PostgreSQL, it’s crucial to use a mechanism that can dynamically discover the current primary. This can be achieved by querying the Patroni API or by using a load balancer that integrates with Patroni (e.g., HAProxy with a script to check Patroni’s API).

Integrating Perl Applications with Auto-Failover PostgreSQL

Connecting Perl applications to a highly available PostgreSQL cluster requires a strategy to handle primary failovers. Hardcoding the primary’s IP address is a recipe for disaster. Instead, applications should be designed to query for the current primary’s endpoint dynamically.

1. Dynamic Connection String Resolution

The most straightforward approach is to have your Perl application query the Patroni REST API to determine the current primary’s IP address and port. This can be done within your application’s connection logic or via a wrapper script.

Here’s a Perl script snippet demonstrating how to fetch the primary’s connection details from Patroni:

use strict;
use warnings;
use LWP::UserAgent;
use JSON;

sub get_pg_primary_connection {
    my ($patroni_api_url) = @_;

    my $ua = LWP::UserAgent->new;
    $ua->timeout(10); # Set a reasonable timeout

    my $response = $ua->get($patroni_api_url . '/cluster');

    if ($response->is_success) {
        my $json_data = $response->decoded_content;
        my $cluster_info = decode_json($json_data);

        if ($cluster_info && $cluster_info->{primary}) {
            my $primary_node = $cluster_info->{primary};
            my $host = $primary_node->{host};
            my $port = $primary_node->{port};
            my $dbname = 'your_database'; # Replace with your database name
            my $user = 'your_user';       # Replace with your database user
            my $password = 'your_password'; # Replace with your database password

            # Construct DSN (Data Source Name)
            return "dbi:Pg:host=$host;port=$port;dbname=$dbname", $user, $password;
        } else {
            warn "Could not find primary node in Patroni cluster info.\n";
            return undef, undef, undef;
        }
    } else {
        warn "Failed to fetch cluster info from Patroni API: " . $response->status_line . "\n";
        return undef, undef, undef;
    }
}

# --- Usage Example ---
my $patroni_base_url = 'http://localhost:8008'; # Or the IP of a Patroni API endpoint

my ($dsn, $user, $password) = get_pg_primary_connection($patroni_base_url);

if ($dsn) {
    print "Successfully retrieved connection details:\n";
    print "DSN: $dsn\n";
    # Now you can use this DSN to connect to PostgreSQL
    # Example using DBI:
    # use DBI;
    # my $dbh = DBI->connect($dsn, $user, $password, { RaiseError => 1, AutoCommit => 1 });
    # print "Connected to PostgreSQL!\n";
    # $dbh->disconnect;
} else {
    print "Failed to get PostgreSQL primary connection details.\n";
}

This function get_pg_primary_connection takes the base URL of the Patroni REST API, queries the /cluster endpoint, parses the JSON response, and extracts the host and port of the current primary. It then constructs a DBI DSN string.

2. Implementing Retry Logic and Connection Pooling

Even with dynamic endpoint resolution, network glitches or brief moments during failover can cause connection failures. Robust Perl applications should implement retry logic for database connections.

use strict;
use warnings;
use DBI;
use LWP::UserAgent;
use JSON;
use constant MAX_RETRIES => 5;
use constant RETRY_DELAY => 5; # seconds

# (Include the get_pg_primary_connection function from above)

sub connect_to_postgres {
    my ($patroni_api_url, $db_name, $db_user, $db_pass) = @_;
    my $dbh = undef;
    my $attempt = 0;

    while ($attempt < MAX_RETRIES) {
        my ($dsn, $user, $password) = get_pg_primary_connection($patroni_api_url);

        if ($dsn) {
            eval {
                $dbh = DBI->connect($dsn, $user, $password, {
                    RaiseError => 1,
                    AutoCommit => 1,
                    PrintError => 0, # Let RaiseError handle it
                    pg_enable_utf8 => 1,
                });
            };

            if ($@) {
                # Connection failed
                warn "Attempt $attempt: Failed to connect to PostgreSQL: $@\n";
                $dbh = undef; # Ensure $dbh is undef on failure
            } else {
                # Connection successful
                print "Successfully connected to PostgreSQL!\n";
                return $dbh;
            }
        } else {
            warn "Attempt $attempt: Could not resolve PostgreSQL primary.\n";
        }

        $attempt++;
        if ($attempt < MAX_RETRIES) {
            print "Retrying in " . RETRY_DELAY . " seconds...\n";
            sleep(RETRY_DELAY);
        }
    }

    die "Failed to connect to PostgreSQL after $MAX_RETRIES attempts.\n";
}

# --- Usage Example ---
my $patroni_base_url = 'http://localhost:8008';
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';

my $dbh = connect_to_postgres($patroni_base_url, $db_name, $db_user, $db_pass);

# Now you can use $dbh for database operations
# Example:
# my $sth = $dbh->prepare("SELECT version()");
# $sth->execute;
# my ($version) = $sth->fetchrow_array;
# print "PostgreSQL Version: $version\n";

# $dbh->disconnect;

This enhanced connect_to_postgres function wraps the connection attempt in a loop, retrying up to MAX_RETRIES times with a delay of RETRY_DELAY seconds between attempts. This significantly improves resilience during transient network issues or during the brief window of a failover event.

For applications with high connection churn or that need to maintain consistent performance, consider implementing connection pooling. Libraries like DBI::Pool can be used, but ensure your pooling strategy correctly invalidates and re-establishes connections when the primary changes. A common pattern is to re-resolve the primary and obtain a new connection from the pool when a query fails due to a connection error.

3. Application-Level Considerations

Beyond connection management, your Perl application should be designed to tolerate brief periods of database unavailability. This might involve:

Implementing idempotent operations where possible, so retrying a failed write operation doesn’t cause duplicate data.
Queuing writes if the database is temporarily unavailable, to be replayed once connectivity is restored.
Gracefully degrading functionality rather than crashing entirely when database access is lost.

By combining Patroni for automated PostgreSQL failover with intelligent connection handling in your Perl applications, you can build a resilient and highly available system on Linode.

Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on Linode

Establishing a High-Availability PostgreSQL Cluster with Patroni

1. Setting up etcd Cluster

2. Installing and Configuring Patroni

3. Verifying Cluster Status and Failover

Integrating Perl Applications with Auto-Failover PostgreSQL

1. Dynamic Connection String Resolution

2. Implementing Retry Logic and Connection Pooling

3. Application-Level Considerations

Recent Posts

Top Categories

Our Products

Our Services