Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and C++ Deployments on AWS

Automated PostgreSQL Failover with Patroni and AWS RDS

Achieving true high availability for PostgreSQL, especially in a cloud-native environment like AWS, necessitates an automated failover strategy. Manual intervention during an outage is unacceptable for production systems. We’ll explore architecting this using Patroni, a template for highly available PostgreSQL solutions, integrated with AWS infrastructure.

While AWS RDS offers Multi-AZ deployments, which provide automatic failover, understanding the underlying mechanisms and how to achieve similar resilience with self-managed PostgreSQL on EC2 (or even within EKS for containerized workloads) is crucial for specific control or cost optimization scenarios. Patroni excels here by managing PostgreSQL cluster states, leader election, and failover orchestration.

Patroni Architecture Overview

Patroni relies on a distributed consensus store for cluster state management. etcd, ZooKeeper, or Consul are common choices. For AWS, etcd is often preferred due to its simplicity and widespread adoption. Patroni monitors the health of PostgreSQL instances and uses the consensus store to elect a leader (primary) and promote replicas to primary if the current leader fails.

Setting up Patroni with etcd on AWS EC2

This section outlines the steps for a minimal Patroni setup. We’ll assume you have at least three EC2 instances: one for etcd, and two for PostgreSQL nodes (which will eventually form a primary/replica pair).

1. Deploying etcd Cluster

A robust etcd cluster requires an odd number of members for quorum. For simplicity, we’ll demonstrate a single-node etcd setup, but in production, a 3 or 5-node cluster is recommended. Install etcd on your designated EC2 instance.

# On the etcd EC2 instance
sudo apt-get update
sudo apt-get install etcd -y

# Configure etcd (example for a single node)
sudo nano /etc/etcd/etcd.conf.yml

name: etcd0
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://<ETCD_PRIVATE_IP>:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://<ETCD_PRIVATE_IP>:2380
initial-cluster: etcd0=http://<ETCD_PRIVATE_IP>:2380
initial-cluster-state: new

Replace <ETCD_PRIVATE_IP> with the private IP address of your etcd EC2 instance. Ensure your EC2 security group allows traffic on ports 2379 (client) and 2380 (peer) from your PostgreSQL nodes.

sudo systemctl enable etcd
sudo systemctl start etcd
sudo systemctl status etcd

2. Installing PostgreSQL and Patroni

On each PostgreSQL EC2 instance, install PostgreSQL and Patroni. We’ll use Python 3 and pip for Patroni.

# On each PostgreSQL EC2 instance
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib python3 python3-pip -y

# Install Patroni and dependencies
sudo pip3 install "patroni[etcd]" psycopg2-binary

3. Configuring Patroni

Create a Patroni configuration file (e.g., patroni.yml) on each PostgreSQL node. The configuration needs to point to the etcd cluster and define PostgreSQL settings.

# patroni.yml on each PostgreSQL EC2 instance
scope: my_postgres_cluster # Unique name for your cluster
namespace: /service/ # etcd namespace for this cluster

restapi:
  listen: 0.0.0.0:8008
  connect_address: <POSTGRES_PRIVATE_IP>:8008 # IP of this PG node

etcd:
  host: <ETCD_PRIVATE_IP>:2379
  protocol: http

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  connect_address: <POSTGRES_PRIVATE_IP>:5432
  data_dir: /var/lib/postgresql/13/main # Adjust version as needed
  bin_dir: /usr/lib/postgresql/13/bin # Adjust version as needed
  pg_hba:
    - host replication replicator <POSTGRES_PRIVATE_IP>/32 md5
    - host replication replicator <POSTGRES_OTHER_NODE_IP>/32 md5
    - host all all 0.0.0.0/0 md5 # For application access, adjust as needed
  replication:
    username: replicator
    password: <REPLICATION_PASSWORD>
  parameters:
    max_connections: 100
    shared_buffers: 256MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    max_replication_slots: 5

# Replication settings (for replicas to connect to primary)
replication_mode: async # or sync, depending on RPO needs

# Tags for node identification
tags:
  nofailover: false
  clonefrom: false

Key points:

Replace <POSTGRES_PRIVATE_IP> with the private IP of the current PostgreSQL EC2 instance.
Replace <ETCD_PRIVATE_IP> with the etcd instance’s private IP.
Replace <POSTGRES_OTHER_NODE_IP> with the private IP of the *other* PostgreSQL node.
Ensure the data_dir and bin_dir match your PostgreSQL installation.
Configure replication user and password.
Adjust pg_hba for your network access requirements.
scope must be identical across all nodes in the cluster.
namespace is the etcd path prefix.
restapi.connect_address and postgresql.connect_address are crucial for Patroni to communicate with itself and for other nodes to connect.

4. Initializing the PostgreSQL Cluster

Start Patroni on one node. It will initialize PostgreSQL, create the replication user, and register itself in etcd as the primary. Then, start Patroni on the second node. It will detect the existing primary and configure itself as a replica.

# On the first PostgreSQL EC2 instance (will become primary)
sudo patroni /etc/patroni/patroni.yml

# On the second PostgreSQL EC2 instance (will become replica)
sudo patroni /etc/patroni/patroni.yml

Monitor the logs of both Patroni instances. You should see the primary initializing and the replica connecting and streaming replication.

Automated Failover Mechanism

Patroni continuously monitors the health of the PostgreSQL primary. This is typically done by attempting to connect to the primary and executing a simple query. If the primary becomes unresponsive:

Patroni on the replica nodes detects the primary’s failure (e.g., via a lost lock in etcd or failed health checks).
Patroni nodes communicate via etcd to elect a new leader.
The replica that is most up-to-date and passes health checks is promoted to primary.
Other nodes reconfigure themselves to replicate from the new primary.
Patroni’s REST API can be used to query cluster status and trigger manual failovers if needed.

Integrating with AWS Services

1. Security Groups and IAM

Ensure your EC2 security groups allow:

PostgreSQL nodes to communicate with the etcd node on port 2379.
PostgreSQL nodes to communicate with each other on port 5432 (for replication and application access).
Patroni REST API ports (8008 by default) if accessed externally.
Outbound access for Patroni to reach etcd.

Use IAM roles for EC2 instances to grant necessary permissions if you integrate with other AWS services (e.g., S3 for backups).

2. Load Balancing for Application Access

Directing application traffic to the current PostgreSQL primary is critical. An AWS Network Load Balancer (NLB) or Application Load Balancer (ALB) can be configured to point to the PostgreSQL port (5432) of your active primary. The challenge is dynamically updating the load balancer target group when a failover occurs.

Option A: Patroni API Integration (Recommended)

You can write a small script or Lambda function that polls Patroni’s REST API (e.g., /primary endpoint) to determine the current primary’s IP. This script can then use the AWS SDK to update the NLB/ALB target group’s registered instances. This script would be triggered by Patroni’s callbacks or run on a schedule.

import boto3
import requests
import json

# AWS Configuration
REGION_NAME = "us-east-1"
TARGET_GROUP_ARN = "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-pg-tg/abcdef1234567890"
PATRONI_PRIMARY_URL = "http://<PATRONI_NODE_IP>:8008/primary" # Poll one of the Patroni nodes

def get_primary_ip():
    try:
        response = requests.get(PATRONI_PRIMARY_URL, timeout=5)
        response.raise_for_status() # Raise an exception for bad status codes
        data = response.json()
        return data.get("host")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching primary info: {e}")
        return None

def update_target_group(primary_ip):
    elbv2_client = boto3.client("elbv2", region_name=REGION_NAME)

    # Get current registered targets
    current_targets = elbv2_client.describe_target_groups(TargetGroupArns=[TARGET_GROUP_ARN])['TargetGroups'][0]['TargetType'] # Simplified, actual logic needs to fetch targets

    # Logic to determine if update is needed (compare primary_ip with current registered targets)
    # This is a placeholder. Real implementation needs to:
    # 1. Fetch current registered targets for the TG.
    # 2. Deregister old targets if they are not the new primary.
    # 3. Register the new primary if it's not already registered.

    print(f"Simulating update for primary IP: {primary_ip}")
    # Example: Deregistering a hypothetical old primary
    # elbv2_client.deregister_targets(TargetGroupArn=TARGET_GROUP_ARN, Targets=[{'Id': 'old_primary_instance_id'}])
    # Example: Registering the new primary
    # elbv2_client.register_targets(TargetGroupArn=TARGET_GROUP_ARN, Targets=[{'Id': 'new_primary_instance_id', 'Port': 5432}])

def lambda_handler(event, context):
    primary_ip = get_primary_ip()
    if primary_ip:
        update_target_group(primary_ip)
        return {"statusCode": 200, "body": json.dumps(f"Target group updated for primary: {primary_ip}")}
    else:
        return {"statusCode": 500, "body": json.dumps("Failed to determine PostgreSQL primary.")}

# For local testing:
# if __name__ == "__main__":
#     # Replace with a valid Patroni node IP for testing
#     PATRONI_PRIMARY_URL = "http://10.0.1.10:8008/primary"
#     update_target_group(get_primary_ip())

This script can be deployed as an AWS Lambda function triggered by CloudWatch Events (e.g., every minute) or, more effectively, by Patroni’s callback.url mechanism, which executes a webhook upon state changes (like failover).

Option B: DNS-based Failover

Use AWS Route 53 with health checks. Create a health check that monitors the PostgreSQL port on the current primary. Configure a DNS record (e.g., db.yourdomain.com) to point to the primary’s IP, with a failover record pointing to a standby. When the primary health check fails, Route 53 automatically updates the DNS to point to the failover record. This requires careful management of the failover record’s IP address, often updated by a similar script as above.

C++ Application Considerations for High Availability

Your C++ application needs to be resilient to database connection interruptions and be able to reconnect to the new primary after a failover.

1. Connection Pooling and Retries

Implement robust connection pooling. When a connection fails (due to a failover), the pool should attempt to establish new connections to the database endpoint (e.g., your NLB/ALB DNS name or Route 53 record). Use exponential backoff for retry attempts to avoid overwhelming the newly promoted primary.

#include <iostream>
#include <pqxx/pqxx> // Example using libpqxx
#include <chrono>
#include <thread>

// Assume db_connection_string is dynamically updated or points to a load balancer
const std::string DB_CONNECTION_STRING = "postgresql://user:[email protected]:5432/mydatabase";
const int MAX_RETRIES = 5;
const std::chrono::seconds INITIAL_BACKOFF(2);

void execute_query(const std::string& query) {
    int retry_count = 0;
    std::chrono::seconds current_backoff = INITIAL_BACKOFF;

    while (retry_count <= MAX_RETRIES) {
        try {
            pqxx::connection C(DB_CONNECTION_STRING);
            if (C.is_open()) {
                std::cout << "Connected to database." << std::endl;
                pqxx::work W(C);
                W.exec(query);
                W.commit();
                std::cout << "Query executed successfully." << std::endl;
                return; // Success
            } else {
                throw std::runtime_error("Connection not open.");
            }
        } catch (const pqxx::broken_connection& e) {
            std::cerr << "Database connection broken: " << e.what() << std::endl;
            retry_count++;
            if (retry_count <= MAX_RETRIES) {
                std::cout << "Retrying in " << current_backoff.count() << " seconds... (Attempt " << retry_count << "/" << MAX_RETRIES << ")" << std::endl;
                std::this_thread::sleep_for(current_backoff);
                current_backoff *= 2; // Exponential backoff
            }
        } catch (const std::exception& e) {
            std::cerr << "An error occurred: " << e.what() << std::endl;
            retry_count++;
            if (retry_count <= MAX_RETRIES) {
                std::cout << "Retrying in " << current_backoff.count() << " seconds... (Attempt " << retry_count << "/" << MAX_RETRIES << ")" << std::endl;
                std::this_thread::sleep_for(current_backoff);
                current_backoff *= 2; // Exponential backoff
            }
        }
    }
    std::cerr << "Failed to execute query after " << MAX_RETRIES << " retries." << std::endl;
    // Consider more drastic error handling here: alert, shutdown, etc.
}

int main() {
    execute_query("SELECT 1;"); // Example query
    // execute_query("INSERT INTO logs (message) VALUES ('Application started');");
    return 0;
}

2. Application Configuration Management

The database connection string (or endpoint) used by your C++ application must be easily updatable. This can be managed via environment variables, configuration files, or a configuration service (like AWS Systems Manager Parameter Store or Consul). When a failover occurs and the database endpoint changes (e.g., Route 53 record updates), your application's configuration needs to reflect this. A common pattern is to have the application periodically refresh its configuration or to restart/reconfigure pods (if containerized) when the endpoint changes.

Monitoring and Alerting

Comprehensive monitoring is non-negotiable. Use tools like Prometheus with PostgreSQL exporters, CloudWatch, or Datadog to track:

PostgreSQL replication lag.
Primary/Replica status.
Patroni health checks.
etcd cluster health.
EC2 instance health.
Load balancer health checks.
Application-level connection success/failure rates.

Configure alerts for critical metrics, such as high replication lag, primary unavailability, or failed failover attempts. Integrate these alerts with your incident management system (e.g., PagerDuty, Opsgenie).

Advanced Considerations

1. Synchronous Replication

For zero data loss (RPO=0), configure synchronous replication in Patroni. This ensures that a transaction is committed on the primary only after it has been acknowledged by at least one synchronous replica. This comes at the cost of increased write latency. Patroni's postgresql.replication_sync_priority and postgresql.replication_sync_state parameters are key here.

# In patroni.yml
postgresql:
  # ... other params
  replication_sync_priority: 1 # Higher number means more preferred for sync
  replication_sync_state: "sync" # or "async"

2. Backup and Restore Strategy

Automated backups are essential. Use tools like pg_basebackup or wal-g to perform base backups and continuous archiving of WAL segments to S3. Patroni can be configured with backup. parameters to trigger backups. Ensure your restore process is regularly tested.

# In patroni.yml
backup:
  enabled: true
  archive_command: "aws s3 cp %p s3://your-backup-bucket/pg_wal/%f"
  restore_command: "aws s3 cp s3://your-backup-bucket/pg_wal/%f %p"
  path: /path/to/backups # Local path for base backups
  target_reset: false
  max_concurrency: 4
  upload_concurrency: 4
  # ... other backup settings

3. Multi-Region Deployments

For disaster recovery across AWS regions, you'd typically set up a Patroni cluster in each region, potentially using Route 53's latency-based routing or geolocation routing. Failover between regions is more complex and often involves manual promotion or a more sophisticated orchestration layer, as network latency and consistency guarantees become significant challenges.

Conclusion

Architecting automated failover for PostgreSQL and its associated C++ applications on AWS requires a layered approach. Patroni provides the core PostgreSQL cluster management, while AWS services like EC2, etcd, NLB/ALB, Route 53, and Lambda handle the infrastructure and integration. Robust C++ application design with proper error handling and configuration management ensures seamless operation through failover events. Continuous testing and monitoring are paramount to validating the resilience of your setup.