Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Perl Deployments on Linode
Establishing a High-Availability PostgreSQL Cluster with Patroni
For critical applications, a single PostgreSQL instance is a single point of failure. Architecting for high availability (HA) necessitates a robust failover mechanism. Patroni, a template for PostgreSQL HA, leverages distributed configuration stores like etcd, Consul, or ZooKeeper to manage cluster state and orchestrate automated failovers. This section details setting up a Patroni-managed PostgreSQL cluster on Linode, focusing on etcd for state management.
We’ll deploy three Linode instances. One will host etcd, and the other two will run PostgreSQL nodes managed by Patroni. For simplicity, we’ll use Debian 11 (Bullseye) on all instances.
1. Setting up etcd Cluster
A highly available etcd cluster is paramount. For a minimal setup, three etcd nodes are recommended. For this example, we’ll consolidate etcd onto a single Linode instance for simplicity, acknowledging that a production deployment would distribute etcd across multiple nodes for its own HA. Install etcd on your designated Linode:
First, add the etcd repository and install the package:
curl -L https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - echo "deb https://packages.cloud.google.com/apt/ cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list apt-get update apt-get install etcd -y
Next, configure etcd. The default configuration is often sufficient for initial testing, but for production, you’ll want to secure it with TLS and configure peer discovery. For this example, we’ll use a basic configuration. Ensure the etcd service is enabled and started:
systemctl enable etcd systemctl start etcd
Verify etcd is running and accessible:
etcdctl member list
If you are running etcd on a separate node, ensure its firewall allows traffic on ports 2379 (client API) and 2380 (peer API).
2. Installing and Configuring Patroni
Patroni is typically installed via pip. Ensure you have Python 3 and pip installed on your PostgreSQL nodes.
apt-get update apt-get install python3-pip python3-dev build-essential libpq-dev postgresql-server-dev-13 -y pip3 install --upgrade pip pip3 install "patroni[etcd]" psycopg2-binary
Now, create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration tells Patroni how to connect to etcd, how to manage PostgreSQL, and replication settings.
# /etc/patroni/patroni.yml scope: my_pg_cluster namespace: /service/ # Optional: prefix for keys in etcd restapi: listen: 0.0.0.0:8008 connect_address: &connect_address:8008 etcd: host: :2379 protocol: http # Use https for TLS secured etcd # PostgreSQL configuration postgresql: listen: 0.0.0.0:5432 connect_address: *connect_address data_dir: /var/lib/postgresql/13/main pg_hba: - host replication replicator 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 replication: username: replicator password: your_replication_password ssl: false parameters: max_connections: 100 shared_buffers: 256MB wal_level: replica hot_standby: "on" max_wal_senders: 10 max_replication_slots: 10 # Tags for node identification tags: nofailover: false clonefrom: false # Bootstrap configuration (only for the initial primary) bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1MB postgresql: use_pg_rewind: true parameters: recovery_target_timeline: "latest" hot_standby: "on" max_connections: 100 shared_buffers: 256MB wal_level: replica max_wal_senders: 10 max_replication_slots: 10 replication_slots: - name: patroni_slot_1 database: template1 plugin: pgoutput - name: patroni_slot_2 database: template1 plugin: pgoutput users: replicator: password: your_replication_password options: - REPLICATION admin: password: your_admin_password options: - SUPERUSER - CREATEDB - CREATEROLE - REPLICATION - LOGIN # Patroni REST API configuration # This is for the Patroni API itself, not the PostgreSQL API # If you want to expose Patroni API to other services, configure it here # api: # listen: 0.0.0.0:8008 # ssl: false # Logging configuration log: level: INFO dir: /var/log/patroni file: patroni.log
Replace <IP_ADDRESS_OF_THIS_NODE> with the actual IP address of the Linode instance where this configuration is being applied. Also, replace <ETCD_NODE_IP> with the IP address of your etcd node. Ensure the PostgreSQL data directory (/var/lib/postgresql/13/main) exists and has the correct ownership (typically postgres:postgres).
Create the log directory and set permissions:
mkdir -p /var/log/patroni chown -R postgres:postgres /var/log/patroni
Now, create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service):
[Unit] Description=Patroni PostgreSQL High-Availability Manager After=network.target [Service] User=postgres Group=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure KillMode=process TimeoutSec=30 [Install] WantedBy=multi-user.target
Enable and start the Patroni service on all PostgreSQL nodes:
systemctl daemon-reload systemctl enable patroni systemctl start patroni
Monitor the Patroni logs (/var/log/patroni/patroni.log) for any errors. The first node to start will bootstrap the cluster, initializing PostgreSQL and creating the necessary replication user and slots. Subsequent nodes will join as replicas.
3. Verifying Cluster Status and Failover
You can check the cluster status by querying the Patroni REST API. On any node, you can use curl:
curl http://localhost:8008/cluster
This will output JSON detailing the cluster state, including the current primary and replicas. To test failover, you can stop the Patroni service on the current primary node:
systemctl stop patroni
Within a short period (determined by the ttl and loop_wait in your patroni.yml), Patroni will detect the primary’s absence, elect a new primary from the available replicas, and reconfigure the remaining replicas. You can observe this process in the logs of the remaining nodes. After the failover, query the cluster status again to confirm the new primary.
For applications connecting to PostgreSQL, it’s crucial to use a mechanism that can dynamically discover the current primary. This can be achieved by querying the Patroni API or by using a load balancer that integrates with Patroni (e.g., HAProxy with a script to check Patroni’s API).
Integrating Perl Applications with Auto-Failover PostgreSQL
Connecting Perl applications to a highly available PostgreSQL cluster requires a strategy to handle primary failovers. Hardcoding the primary’s IP address is a recipe for disaster. Instead, applications should be designed to query for the current primary’s endpoint dynamically.
1. Dynamic Connection String Resolution
The most straightforward approach is to have your Perl application query the Patroni REST API to determine the current primary’s IP address and port. This can be done within your application’s connection logic or via a wrapper script.
Here’s a Perl script snippet demonstrating how to fetch the primary’s connection details from Patroni:
use strict;
use warnings;
use LWP::UserAgent;
use JSON;
sub get_pg_primary_connection {
my ($patroni_api_url) = @_;
my $ua = LWP::UserAgent->new;
$ua->timeout(10); # Set a reasonable timeout
my $response = $ua->get($patroni_api_url . '/cluster');
if ($response->is_success) {
my $json_data = $response->decoded_content;
my $cluster_info = decode_json($json_data);
if ($cluster_info && $cluster_info->{primary}) {
my $primary_node = $cluster_info->{primary};
my $host = $primary_node->{host};
my $port = $primary_node->{port};
my $dbname = 'your_database'; # Replace with your database name
my $user = 'your_user'; # Replace with your database user
my $password = 'your_password'; # Replace with your database password
# Construct DSN (Data Source Name)
return "dbi:Pg:host=$host;port=$port;dbname=$dbname", $user, $password;
} else {
warn "Could not find primary node in Patroni cluster info.\n";
return undef, undef, undef;
}
} else {
warn "Failed to fetch cluster info from Patroni API: " . $response->status_line . "\n";
return undef, undef, undef;
}
}
# --- Usage Example ---
my $patroni_base_url = 'http://localhost:8008'; # Or the IP of a Patroni API endpoint
my ($dsn, $user, $password) = get_pg_primary_connection($patroni_base_url);
if ($dsn) {
print "Successfully retrieved connection details:\n";
print "DSN: $dsn\n";
# Now you can use this DSN to connect to PostgreSQL
# Example using DBI:
# use DBI;
# my $dbh = DBI->connect($dsn, $user, $password, { RaiseError => 1, AutoCommit => 1 });
# print "Connected to PostgreSQL!\n";
# $dbh->disconnect;
} else {
print "Failed to get PostgreSQL primary connection details.\n";
}
This function get_pg_primary_connection takes the base URL of the Patroni REST API, queries the /cluster endpoint, parses the JSON response, and extracts the host and port of the current primary. It then constructs a DBI DSN string.
2. Implementing Retry Logic and Connection Pooling
Even with dynamic endpoint resolution, network glitches or brief moments during failover can cause connection failures. Robust Perl applications should implement retry logic for database connections.
use strict;
use warnings;
use DBI;
use LWP::UserAgent;
use JSON;
use constant MAX_RETRIES => 5;
use constant RETRY_DELAY => 5; # seconds
# (Include the get_pg_primary_connection function from above)
sub connect_to_postgres {
my ($patroni_api_url, $db_name, $db_user, $db_pass) = @_;
my $dbh = undef;
my $attempt = 0;
while ($attempt < MAX_RETRIES) {
my ($dsn, $user, $password) = get_pg_primary_connection($patroni_api_url);
if ($dsn) {
eval {
$dbh = DBI->connect($dsn, $user, $password, {
RaiseError => 1,
AutoCommit => 1,
PrintError => 0, # Let RaiseError handle it
pg_enable_utf8 => 1,
});
};
if ($@) {
# Connection failed
warn "Attempt $attempt: Failed to connect to PostgreSQL: $@\n";
$dbh = undef; # Ensure $dbh is undef on failure
} else {
# Connection successful
print "Successfully connected to PostgreSQL!\n";
return $dbh;
}
} else {
warn "Attempt $attempt: Could not resolve PostgreSQL primary.\n";
}
$attempt++;
if ($attempt < MAX_RETRIES) {
print "Retrying in " . RETRY_DELAY . " seconds...\n";
sleep(RETRY_DELAY);
}
}
die "Failed to connect to PostgreSQL after $MAX_RETRIES attempts.\n";
}
# --- Usage Example ---
my $patroni_base_url = 'http://localhost:8008';
my $db_name = 'your_database';
my $db_user = 'your_user';
my $db_pass = 'your_password';
my $dbh = connect_to_postgres($patroni_base_url, $db_name, $db_user, $db_pass);
# Now you can use $dbh for database operations
# Example:
# my $sth = $dbh->prepare("SELECT version()");
# $sth->execute;
# my ($version) = $sth->fetchrow_array;
# print "PostgreSQL Version: $version\n";
# $dbh->disconnect;
This enhanced connect_to_postgres function wraps the connection attempt in a loop, retrying up to MAX_RETRIES times with a delay of RETRY_DELAY seconds between attempts. This significantly improves resilience during transient network issues or during the brief window of a failover event.
For applications with high connection churn or that need to maintain consistent performance, consider implementing connection pooling. Libraries like DBI::Pool can be used, but ensure your pooling strategy correctly invalidates and re-establishes connections when the primary changes. A common pattern is to re-resolve the primary and obtain a new connection from the pool when a query fails due to a connection error.
3. Application-Level Considerations
Beyond connection management, your Perl application should be designed to tolerate brief periods of database unavailability. This might involve:
- Implementing idempotent operations where possible, so retrying a failed write operation doesn’t cause duplicate data.
- Queuing writes if the database is temporarily unavailable, to be replayed once connectivity is restored.
- Gracefully degrading functionality rather than crashing entirely when database access is lost.
By combining Patroni for automated PostgreSQL failover with intelligent connection handling in your Perl applications, you can build a resilient and highly available system on Linode.