Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Laravel Deployments on DigitalOcean

Leveraging PostgreSQL Streaming Replication for High Availability

Achieving automated failover for a PostgreSQL database is paramount for any production-grade application. The cornerstone of this strategy is PostgreSQL’s built-in streaming replication. We’ll configure a primary and at least one replica, ensuring data consistency and enabling a swift switchover in case of primary failure.

Our setup will involve two DigitalOcean Droplets, one for the primary PostgreSQL instance and another for the replica. For simplicity, we’ll use Ubuntu 22.04 LTS. We’ll also assume SSH access and root privileges for these commands.

Primary PostgreSQL Server Configuration

On the primary server, we need to enable WAL (Write-Ahead Logging) archiving and configure it for streaming replication. This involves modifying the postgresql.conf and pg_hba.conf files.

First, locate your PostgreSQL configuration directory. On Ubuntu, this is typically /etc/postgresql/14/main/ (version may vary).

Edit postgresql.conf:

# postgresql.conf
listen_addresses = '*'
wal_level = replica
wal_sync_method = fsync
wal_compression = on
max_wal_senders = 5
wal_keep_size = 1024 # Or a sufficiently large value to accommodate network latency
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Ensure this directory exists and has correct permissions
hot_standby = on

Next, edit pg_hba.conf to allow replication connections from the replica server. Replace <replica_ip_address> with the actual IP of your replica Droplet.

# pg_hba.conf
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      <replica_ip_address>/32   scram-sha-256
host    replication     replicator      127.0.0.1/32            scram-sha-256 # For local replication testing
host    all             all             0.0.0.0/0               scram-sha-256 # Adjust for security as needed

Create the WAL archive directory and set appropriate permissions:

sudo mkdir -p /var/lib/postgresql/wal_archive
sudo chown -R postgres:postgres /var/lib/postgresql/wal_archive
sudo chmod 700 /var/lib/postgresql/wal_archive

Create a replication user and grant necessary privileges:

-- Connect to PostgreSQL as the postgres user
sudo -u postgres psql

-- Inside psql:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_strong_replication_password';
\q

Restart PostgreSQL to apply the changes:

sudo systemctl restart postgresql

Replica PostgreSQL Server Configuration

On the replica server, we’ll first stop PostgreSQL, clean its data directory, and then initialize it as a replica of the primary.

sudo systemctl stop postgresql

Remove the existing data directory (ensure no critical data is present here, this is a fresh setup):

sudo rm -rf /var/lib/postgresql/14/main/*

Initialize the replica using the pg_basebackup utility. Replace <primary_ip_address> and your_strong_replication_password.

sudo -u postgres pg_basebackup -h <primary_ip_address> -U replicator -D /var/lib/postgresql/14/main -vP -W

After the base backup completes, create a standby.signal file in the data directory to indicate it’s a standby server:

sudo touch /var/lib/postgresql/14/main/standby.signal

Edit postgresql.conf on the replica to enable hot standby. Ensure hot_standby is set to on. You might also want to adjust primary_conninfo if you’re using a different IP or port for the primary.

# postgresql.conf (on replica)
hot_standby = on
primary_conninfo = 'host=<primary_ip_address> port=5432 user=replicator password=your_strong_replication_password'
# Optional: For faster failover detection, consider tuning recovery settings
# recovery_timeout = 60 # seconds
# standby_mode = 'on' # This is implicitly set by standby.signal, but good to be aware of

Start PostgreSQL on the replica:

sudo systemctl start postgresql

Verify replication status on both servers. On the primary:

sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"

You should see an entry for the replica. On the replica, check the PostgreSQL logs (e.g., /var/log/postgresql/postgresql-14-main.log) for messages indicating it’s connected and streaming.

Automating Failover with Patroni

While streaming replication ensures data availability, it doesn’t automate the failover process. For robust automated failover, we’ll integrate Patroni, a template for PostgreSQL HA.

Patroni requires a distributed configuration store. etcd is a popular choice. We’ll set up a small etcd cluster (or use a managed etcd service if available).

Setting up etcd

For a minimal setup, we can run etcd on a separate Droplet or even on the PostgreSQL nodes themselves (though a dedicated node is recommended for production). Install etcd:

# On etcd node(s)
wget -qO- https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz | tar xz --strip-components=1 -C /usr/local/bin
sudo systemctl enable etcd
sudo systemctl start etcd

A basic etcd configuration (/etc/etcd/etcd.conf.yml) might look like this for a single node:

# /etc/etcd/etcd.conf.yml
name: etcd-node-1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://<etcd_ip_address>:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://<etcd_ip_address>:2380
initial-cluster: etcd-node-1=http://<etcd_ip_address>:2380
initial-cluster-token: my-etcd-cluster-token
initial-cluster-state: new

Ensure the etcd service is running and accessible from your PostgreSQL nodes.

Installing and Configuring Patroni

Install Patroni on each PostgreSQL node. Using pip is a common method.

# On each PostgreSQL node
sudo apt update
sudo apt install python3-pip python3-dev build-essential libpq-dev -y
sudo pip3 install --upgrade pip
sudo pip3 install "patroni[etcd]" psycopg2-binary

Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration needs to be consistent across all nodes, with slight variations for IP addresses.

# /etc/patroni/patroni.yml
scope: my_pg_cluster # Unique name for your PostgreSQL cluster
namespace: /service/ # Namespace within etcd for this cluster

# etcd configuration
etcd:
  host: <etcd_ip_address>:2379
  protocol: http

# PostgreSQL configuration
postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin
  config_dir: /etc/postgresql/14/main
  pg_hba:
    - host    replication     replicator      <replica_ip_address>/32   scram-sha-256
    - host    replication     replicator      <primary_ip_address>/32   scram-sha-256
    - host    all             all             0.0.0.0/0               scram-sha-256
  replication:
    username: replicator
    password: your_strong_replication_password
  authentication:
    replication:
      username: replicator
      password: your_strong_replication_password
    superuser:
      username: postgres
      password: your_strong_postgres_password # Use a strong password
  parameters:
    wal_level: replica
    max_wal_senders: 5
    wal_keep_size: 1024
    hot_standby: on
    archive_mode: on
    archive_command: 'cp %p /var/lib/postgresql/wal_archive/%f'

# Patroni REST API configuration
restapi:
  listen: 0.0.0.0:8008
  connect_address: <node_ip_address>:8008

# Tags for identifying nodes
tags:
  nofailover: false
  clonefrom: false

# Replication settings for initial bootstrap and failover
replication:
  synchronous_mode: false # Set to true for synchronous replication if latency permits
  synchronous_node_count: 1 # If synchronous_mode is true

# Bootstrap configuration (only used if no cluster exists in etcd)
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576 # 1MB, adjust as needed
    postgresql:
      use_pg_rewind: true
      recovery_conf:
        standby_mode: 'on'
        primary_conninfo: 'host=%s port=%d user=%s password=%s'
        # Optional: tune recovery settings for faster failover
        # recovery_timeout: 60
        # wal_receiver_status_interval: 10
        # wal_retrieve_retry_interval: 5

# Logging configuration
log:
  level: INFO
  dir: /var/log/patroni
  file: patroni.log

Ensure the <node_ip_address> in the restapi.connect_address is the correct IP for each respective node. Also, create the log directory:

sudo mkdir -p /var/log/patroni
sudo chown -R patroni:patroni /var/log/patroni

Create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service):

[Unit]
Description=Patroni PostgreSQL High-Availability
After=network.target

[Service]
User=patroni
Group=patroni
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Start and enable the Patroni service on all PostgreSQL nodes:

sudo systemctl daemon-reload
sudo systemctl start patroni
sudo systemctl enable patroni

Monitor the Patroni logs. The first node to start will bootstrap the cluster. Subsequent nodes will join as replicas. Patroni will automatically manage leader election and failover.

Integrating Laravel with Patroni’s API

Your Laravel application needs to be aware of the current primary PostgreSQL instance. Patroni exposes a REST API that provides this information. We’ll create a simple service or helper in Laravel to query this API and dynamically update the database connection details.

First, ensure your Laravel application can reach the Patroni REST API endpoints (port 8008) on your PostgreSQL nodes. You might need to configure firewall rules on DigitalOcean.

Create a new service provider or a dedicated service class in your Laravel application (e.g., app/Services/PostgresHA.php).

namespace App\Services;

use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;

class PostgresHA
{
    protected array $patroniNodes;
    protected string $cacheKey = 'patroni_primary_db_config';
    protected int $cacheDuration = 60; // Cache for 60 seconds

    public function __construct()
    {
        // Load these from your .env file
        $this->patroniNodes = explode(',', env('PATRONI_NODES', 'http://node1:8008,http://node2:8008'));
    }

    public function getPrimaryDatabaseConfig(): ?array
    {
        // Try to get from cache first
        if (Cache::has($this->cacheKey)) {
            return Cache::get($this->cacheKey);
        }

        foreach ($this->patroniNodes as $nodeUrl) {
            try {
                $response = Http::timeout(5)->get("{$nodeUrl}/primary");
                if ($response->successful()) {
                    $data = $response->json();
                    if (isset($data['host']) && isset($data['port'])) {
                        $config = [
                            'host' => $data['host'],
                            'port' => $data['port'],
                            'database' => env('DB_DATABASE'),
                            'username' => env('DB_USERNAME'),
                            'password' => env('DB_PASSWORD'),
                            'charset' => 'utf8mb4',
                            'collation' => 'utf8mb4_unicode_ci',
                            'prefix' => '',
                            'strict' => true,
                            'engine' => null,
                        ];
                        Cache::put($this->cacheKey, $config, now()->addSeconds($this->cacheDuration));
                        return $config;
                    }
                }
            } catch (\Exception $e) {
                Log::error("Failed to connect to Patroni node {$nodeUrl}: " . $e->getMessage());
            }
        }

        Log::error("Could not determine primary PostgreSQL node from Patroni.");
        return null;
    }

    public function updateDatabaseConnection(): bool
    {
        $config = $this->getPrimaryDatabaseConfig();

        if ($config) {
            config(['database.connections.pgsql' => $config]);
            return true;
        }

        return false;
    }
}

Add the Patroni nodes to your .env file:

PATRONI_NODES=http://<node1_ip>:8008,http://<node2_ip>:8008

Now, you need to ensure this configuration is applied before your application tries to access the database. A good place is within your config/database.php or a service provider that runs early in the application’s bootstrap process.

Modify config/database.php to dynamically fetch the connection details:

use App\Services\PostgresHA;

// ... other configurations

'connections' => [
    // ... other connections

    'pgsql' => [
        'driver' => 'pgsql',
        'url' => env('DATABASE_URL'),
        'host' => env('DB_HOST', '127.0.0.1'),
        'port' => env('DB_PORT', '5432'),
        'database' => env('DB_DATABASE', 'forge'),
        'username' => env('DB_USERNAME', 'forge'),
        'password' => env('DB_PASSWORD', ''),
        'charset' => 'utf8mb4',
        'prefix' => '',
        'prefix_indexes' => true,
        'schema' => 'public',
        'sslmode' => 'prefer',
    ],

    // ... other connections
],

// ...

// Add this logic at the end of the file, or within a service provider
// Ensure PostgresHA service is resolved correctly
if (env('APP_ENV') !== 'testing') { // Avoid running in tests if not configured
    try {
        $postgresHA = app(PostgresHA::class);
        if ($postgresHA->updateDatabaseConnection()) {
            // Configuration updated, Laravel will now use the dynamic settings
            // You might want to log this for debugging
            Log::info('Database connection dynamically updated via Patroni.');
        } else {
            Log::error('Failed to dynamically update database connection via Patroni.');
            // Fallback to default or throw an exception
            // For critical applications, you might want to halt execution here
        }
    } catch (\Exception $e) {
        Log::error('Error initializing PostgresHA service: ' . $e->getMessage());
        // Handle error, potentially halt or use fallback
    }
}

Alternatively, and often cleaner, is to use a dedicated service provider:

// app/Providers/DatabaseServiceProvider.php
namespace App\Providers;

use App\Services\PostgresHA;
use Illuminate\Support\ServiceProvider;
use Illuminate\Support\Facades\Log;
use Illuminate\Support\Facades\DB;

class DatabaseServiceProvider extends ServiceProvider
{
    /**
     * Register services.
     */
    public function register(): void
    {
        //
    }

    /**
     * Bootstrap services.
     */
    public function boot(): void
    {
        if (env('APP_ENV') !== 'testing' && env('DB_CONNECTION') === 'pgsql') {
            try {
                $postgresHA = $this->app->make(PostgresHA::class);
                if ($postgresHA->updateDatabaseConnection()) {
                    Log::info('Database connection dynamically updated via Patroni.');
                } else {
                    Log::error('Failed to dynamically update database connection via Patroni.');
                    // Consider throwing an exception to halt if critical
                    // throw new \RuntimeException('Database connection failed to initialize.');
                }
            } catch (\Exception $e) {
                Log::error('Error bootstrapping database connection: ' . $e->getMessage());
                // Consider throwing an exception
                // throw new \RuntimeException('Database connection failed to initialize.');
            }
        }
    }
}

// config/app.php
// ...
'providers' => [
    // ...
    App\Providers\DatabaseServiceProvider::class,
    // ...
],
// ...

Testing Failover

To test the failover, you can simulate a failure of the primary PostgreSQL instance. The safest way is to stop the Patroni service on the current primary node:

# On the current primary node
sudo systemctl stop patroni

Observe the Patroni logs on the remaining nodes. Patroni should detect the primary’s absence, elect a new primary from the available replicas, and reconfigure them. Your Laravel application, upon its next database access (or after its cache for the primary config expires), should query Patroni again and connect to the new primary.

You can also manually trigger a failover using Patroni’s API, though stopping the service is a more direct test of its failure detection capabilities.

Considerations for Production

Monitoring: Implement robust monitoring for PostgreSQL, Patroni, and etcd. Alert on replication lag, Patroni health checks, and etcd cluster status.
Network: Ensure low latency and high bandwidth between your PostgreSQL nodes and between PostgreSQL nodes and etcd. Use DigitalOcean’s private networking.
Security: Use SSL/TLS for etcd communication and PostgreSQL replication. Secure your Patroni API endpoints.
Backups: Automated failover is not a substitute for backups. Implement a separate, reliable backup strategy (e.g., using pg_dump or DigitalOcean’s managed database backups if applicable).
Read Replicas: For read-heavy workloads, provision additional read replicas and configure your application to direct read traffic to them. Patroni can manage these as well.
Testing: Regularly test your failover procedures in a staging environment that mirrors production.
Configuration Management: Use tools like Ansible, Chef, or Puppet to manage your PostgreSQL and Patroni configurations across all nodes.