Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Laravel Deployments on AWS

Leveraging AWS RDS Multi-AZ for PostgreSQL High Availability

For mission-critical PostgreSQL deployments, Amazon RDS Multi-AZ offers a robust, managed solution for automatic failover. This configuration provisions a synchronous standby replica in a different Availability Zone (AZ). In the event of a primary instance failure (e.g., instance hardware failure, AZ outage, network disruption), RDS automatically initiates a failover to the standby replica. The DNS record for your DB instance is updated to point to the standby, and the failover process typically completes within 60-120 seconds. This minimizes downtime without requiring manual intervention.

When setting up a new RDS instance or modifying an existing one, ensure the “Multi-AZ deployment” option is set to “Yes”.

Configuring RDS for Automatic Failover

The primary configuration for Multi-AZ is a simple toggle within the AWS RDS console or via the AWS CLI/SDK. No application-level changes are strictly necessary for basic failover, as RDS handles the underlying infrastructure and DNS updates. However, understanding the implications for your application is crucial.

AWS CLI Example: Creating a Multi-AZ PostgreSQL Instance

aws rds create-db-instance \
    --db-instance-identifier my-postgres-ha-instance \
    --db-instance-class db.r5.large \
    --engine postgres \
    --allocated-storage 100 \
    --master-username admin \
    --master-user-password YOUR_SECURE_PASSWORD \
    --vpc-security-group-ids sg-0123456789abcdef0 \
    --db-subnet-group-name my-db-subnet-group \
    --multi-az \
    --backup-retention-period 7 \
    --preferred-backup-window "03:00-04:00" \
    --preferred-maintenance-window "sun:04:00-sun:05:00" \
    --tags Key=Environment,Value=Production Key=Project,Value=MyApp

Key parameters:

--db-instance-identifier: Unique name for your RDS instance.
--db-instance-class: The compute and memory capacity.
--engine: Specifies PostgreSQL.
--multi-az: This flag is essential for enabling automatic failover.
--db-subnet-group-name: Must span at least two AZs within your VPC.
--vpc-security-group-ids: Controls network access to the RDS instance.

Application Connection Handling During Failover

While RDS handles the infrastructure failover, your Laravel application needs to gracefully handle the brief connection interruption. During a failover, the DB endpoint remains the same, but the underlying IP address changes. Applications that maintain persistent connections or have aggressive connection pooling might experience issues. The standard behavior for most database drivers is to attempt a reconnect upon encountering a connection error. However, the duration of the interruption (up to 2 minutes) can lead to timeouts.

Strategies for Laravel Applications

1. Connection Retries with Backoff: Implement a robust retry mechanism in your application’s database connection logic. Most modern PHP database drivers and ORMs (like Eloquent) have built-in retry capabilities or can be extended. A simple approach is to catch connection exceptions and retry the connection after a short, increasing delay (exponential backoff).

2. Connection Pooling (Caution): If using external connection poolers (e.g., PgBouncer), ensure they are configured to detect and re-establish broken connections. However, for RDS Multi-AZ, relying on the application’s built-in retry is often simpler and more effective, as the DNS change is managed by RDS.

3. Idempotent Operations: Design critical database operations to be idempotent. This means that executing the operation multiple times has the same effect as executing it once. This is crucial for operations that might be retried after a failover, preventing duplicate data or unintended side effects.

Example: Basic Connection Retry in Laravel (Conceptual)

While Laravel’s Eloquent typically handles basic reconnections, for more aggressive retries, you might wrap critical queries or operations in a retry loop. This is often best handled at a lower level or through a custom database query builder extension.

A more practical approach is to ensure your application’s web server (e.g., PHP-FPM) is configured to restart gracefully or that your load balancer can detect unhealthy application instances that might be stuck waiting for a database connection.

Monitoring and Testing Failover

Regularly testing your failover mechanism is paramount. AWS RDS provides a “Reboot” option that can simulate a failover if you select the “Reboot with failover” option. This allows you to validate your application’s resilience without impacting production availability.

Simulating Failover via AWS Console

1. Navigate to the RDS console.
2. Select your Multi-AZ DB instance.
3. Click the “Actions” dropdown.
4. Choose “Reboot”.
5. In the confirmation dialog, select “Reboot with failover”.
6. Monitor the instance status; it will transition through “rebooting” to “available”. Observe the “Failover time” metric in CloudWatch.

Monitoring Key Metrics

Utilize Amazon CloudWatch to monitor critical RDS metrics:

ReplicaLag: For read replicas, though less relevant for Multi-AZ primary failover.
CPUUtilization, FreeableMemory, ReadIOPS, WriteIOPS: General instance health.
Custom Metric: EngineUptime: Track the duration of downtime during failover.
Custom Metric: FailedLoginAttempts: Can indicate issues during connection re-establishment.

Considerations for Read Replicas

If you are using RDS Read Replicas for scaling read traffic, understand that they are *not* automatically promoted during a Multi-AZ primary failover. After the primary fails over, you will need to manually promote a read replica or configure a separate automated process to promote one if read availability is critical immediately after a primary failover. For true automated disaster recovery involving read replicas, consider cross-region replication and automated promotion strategies.

Architecting Auto-Failover for Self-Managed PostgreSQL on EC2

When managing PostgreSQL on EC2 instances, achieving automatic failover requires a more sophisticated, multi-component architecture. This typically involves a primary database server, a streaming replication standby server in a different AZ, a virtual IP (VIP) address that floats between the primary and standby, and a cluster management tool to monitor health and orchestrate failover.

Core Components of a Self-Managed HA Setup

PostgreSQL Streaming Replication: Configured for synchronous or asynchronous replication from the primary to the standby. Synchronous replication guarantees no data loss but can introduce latency. Asynchronous replication offers lower latency but carries a small risk of data loss during failover.
Pacemaker/Corosync (or similar): A robust cluster resource manager that monitors the health of the PostgreSQL service and the VIP. It orchestrates the promotion of the standby and the movement of the VIP.
Virtual IP (VIP) Address: A floating IP address that your application connects to. Pacemaker assigns this VIP to the active (primary) PostgreSQL server. During failover, it’s moved to the standby.
Patroni (Recommended): A popular, open-source template for PostgreSQL HA. Patroni automates many of the complexities of PostgreSQL HA, including replication management, leader election (using etcd, Consul, or ZooKeeper), and failover orchestration. It often works in conjunction with HAProxy for connection routing.

Example: Patroni with etcd and HAProxy

This is a common and highly effective pattern. Patroni manages the PostgreSQL cluster state and leader election using etcd. HAProxy sits in front of the PostgreSQL instances, routing read/write traffic to the current primary and read traffic to replicas.

1. Setting up etcd Cluster

Deploy a highly available etcd cluster (e.g., 3 or 5 nodes) across different AZs. Ensure it’s accessible from your PostgreSQL nodes.

# Example etcdctl command to check cluster health
ETCDCTL_API=3 etcdctl endpoint health --endpoints=http://etcd-node1:2379,http://etcd-node2:2379,http://etcd-node3:2379

2. Configuring Patroni

Each PostgreSQL node runs Patroni. The configuration file (e.g., patroni.yml) specifies etcd endpoints, PostgreSQL data directory, replication settings, and callback scripts for actions like promoting a replica.

# Example patroni.yml
scope: my-postgres-cluster
namespace: /service/my-postgres-cluster

etcd:
  host: etcd-node1:2379,etcd-node2:2379,etcd-node3:2379
  protocol: http

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  config_dir: /etc/postgresql
  bin_dir: /usr/lib/postgresql/13/bin

  # Replication settings (example for synchronous replication)
  replication:
    synchronous_mode: true
    synchronous_standby_names: my_replica_tag

  # Callbacks for failover events
  callbacks:
    on_stop: /usr/local/bin/patroni_callback_stop.sh
    on_start: /usr/local/bin/patroni_callback_start.sh
    on_role_change: /usr/local/bin/patroni_callback_role_change.sh

3. Configuring HAProxy

HAProxy instances should be deployed in an active/passive or active/active configuration, also spanning multiple AZs. Patroni can be configured to update HAProxy’s backend servers dynamically when a role change occurs.

# Example haproxy.cfg
frontend pgsql_frontend
    bind *:5000
    mode tcp
    default_backend pgsql_backend

backend pgsql_backend
    mode tcp
    balance roundrobin
    option httpchk GET /primary # Patroni health check endpoint
    server primary_db 10.0.1.10:5432 check port 5433 # IP of current primary
    server replica_db 10.0.1.20:5432 check port 5433 # IP of current replica

Patroni’s on_role_change callback script would typically use the HAProxy Runtime API to update backend server configurations.

Application Connection String

Your Laravel application connects to the HAProxy endpoint:

// config/database.php
'pgsql' => [
    'driver' => 'pgsql',
    'host' => env('DB_HOST', 'haproxy-endpoint.yourdomain.com'), // HAProxy VIP or DNS
    'port' => env('DB_PORT', 5000), // HAProxy port
    'database' => env('DB_DATABASE', 'your_database'),
    'username' => env('DB_USERNAME', 'your_user'),
    'password' => env('DB_PASSWORD', 'your_password'),
    'charset' => 'utf8',
    'prefix' => '',
    'schema' => 'public',
    'sslmode' => 'prefer',
],

Automating Failover with Patroni Callbacks

The on_role_change callback script is crucial. When Patroni detects a new primary, it executes this script on all nodes. The script’s responsibility is to update the connection routing mechanism (e.g., HAProxy) to point to the new primary. This ensures that subsequent application connections are directed correctly.

#!/bin/bash
# Example patroni_callback_role_change.sh (simplified)

ROLE=$1 # 'master' or 'replica'
PG_HOST=$2 # Hostname of the current PostgreSQL instance
PG_PORT=$3 # Port of the current PostgreSQL instance
HAPROXY_API_URL="http://localhost:1935/v2/rest/v2/services/haproxy/backend/servers/pgsql_backend" # Example

if [ "$ROLE" == "master" ]; then
    echo "Detected new master: $PG_HOST:$PG_PORT"
    # Update HAProxy to point to the new master
    # This requires HAProxy's stats socket or API to be enabled and configured
    # Example using curl to update HAProxy via its API (requires setup)
    curl -X POST "$HAPROXY_API_URL" -d "name=primary_db&address=$PG_HOST:$PG_PORT&check=enabled"
    # Potentially disable or remove the old master from the backend if it's still listed
elif [ "$ROLE" == "replica" ]; then
    echo "Detected new replica: $PG_HOST:$PG_PORT"
    # Ensure this replica is configured as a read-only backend in HAProxy
    # Or if it's the *only* replica and the old master is down, it might become the new primary
    # The logic here depends heavily on your HAProxy setup and Patroni's role promotion
    : # Placeholder for replica logic
fi
exit 0

Monitoring and Alerting for Self-Managed HA

Comprehensive monitoring is non-negotiable:

etcd Health: Monitor etcd cluster health, leader status, and latency.
Patroni Status: Check Patroni’s API endpoints for cluster state (leader, replicas, health).
PostgreSQL Replication Lag: Monitor pg_stat_replication on the primary and replication status on the standby.
HAProxy Health Checks: Ensure HAProxy is correctly reporting backend server health.
Application Connectivity: Monitor application errors related to database connection timeouts or failures.
System Metrics: CPU, memory, disk I/O, and network traffic on all database and etcd nodes.

Set up alerts for any deviations from normal operating parameters, especially for critical events like leader loss in etcd, Patroni reporting unhealthy nodes, or sustained replication lag.

Architecting Auto-Failover for Laravel Queues

Laravel’s queue system, especially when using Redis or database drivers, also requires consideration for high availability. A failure in the queue worker or the underlying queue driver can halt background job processing.

Redis-Based Queues with ElastiCache

For Redis-based queues, leveraging Amazon ElastiCache for Redis provides managed high availability. Configure ElastiCache for Redis with Multi-AZ replication. This ensures that if the primary Redis node fails, ElastiCache automatically fails over to a replica node in another AZ. Your Laravel application connects to the ElastiCache cluster endpoint. Similar to RDS, ElastiCache handles the DNS update during failover. Your application should be resilient to brief connection interruptions.

ElastiCache for Redis Configuration

When creating an ElastiCache for Redis cluster, select the “Multi-AZ with automatic failover” option. This provisions a primary node and one or more replica nodes in different AZs.

Laravel Queue Worker Resilience

Ensure your queue workers are managed by a process supervisor like Supervisor or systemd. These tools can automatically restart failed workers. If a worker crashes due to a temporary Redis outage, the supervisor will attempt to restart it. Configure multiple queue worker instances across different EC2 instances or containers, ideally behind a load balancer if they are exposed externally for management.

# Example Supervisor configuration for Laravel queue worker
[program:laravel-queue]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/html/artisan queue:work sqs --queue=high-priority,default --sleep=3 --tries=3 --daemon
autostart=true
autorestart=true
user=www-data
numprocs=4 # Run 4 worker processes
redirect_stderr=true
stdout_logfile=/var/log/supervisor/laravel-queue.log

Database-Based Queues

If using the database driver for Laravel queues, the availability of your queue table is directly tied to the availability of your primary PostgreSQL database. Therefore, the RDS Multi-AZ or self-managed HA setup described earlier for PostgreSQL is essential. Ensure your queue worker processes are robust and can handle temporary database connection drops, similar to the main application’s database connection handling.

SQS as a Highly Available Queue Service

For maximum resilience and minimal operational overhead, consider using Amazon Simple Queue Service (SQS). SQS is a fully managed, highly available message queuing service. It’s designed for durability and availability, with messages stored redundantly across multiple AZs. Laravel has excellent built-in support for SQS.

Laravel SQS Configuration

// config/queue.php
'connections' => [
    // ... other connections
    'sqs' => [
        'driver' => 'sqs',
        'key' => env('AWS_ACCESS_KEY_ID'),
        'secret' => env('AWS_SECRET_ACCESS_KEY'),
        'region' => env('AWS_DEFAULT_REGION', 'us-east-1'),
        'queue' => env('AWS_SQS_QUEUE_URL'),
        'after_commit' => false,
    ],
    // ...
],

Using SQS offloads the complexity of managing queue infrastructure availability entirely to AWS. Your Laravel application simply needs to be able to communicate with the SQS service endpoints. Ensure your IAM roles or credentials have the necessary permissions for SQS.

Conclusion: A Layered Approach to Resilience

Architecting for auto-failover is not a single solution but a layered strategy. For PostgreSQL, RDS Multi-AZ offers a managed, highly available solution. For self-managed PostgreSQL, tools like Patroni combined with etcd and HAProxy provide robust, albeit more complex, HA. For Laravel queues, ElastiCache for Redis or, ideally, AWS SQS provide managed resilience. By implementing these strategies, you can significantly reduce downtime and ensure the continuous availability of your critical applications and background processing.