Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Laravel Deployments on DigitalOcean
Leveraging PostgreSQL Streaming Replication for High Availability
Achieving automated failover for a PostgreSQL database is paramount for any production-grade application. The cornerstone of this strategy is PostgreSQL’s built-in streaming replication. We’ll configure a primary and at least one replica, ensuring data consistency and enabling a swift switchover in case of primary failure.
Our setup will involve two DigitalOcean Droplets, one for the primary PostgreSQL instance and another for the replica. For simplicity, we’ll use Ubuntu 22.04 LTS. We’ll also assume SSH access and root privileges for these commands.
Primary PostgreSQL Server Configuration
On the primary server, we need to enable WAL (Write-Ahead Logging) archiving and configure it for streaming replication. This involves modifying the postgresql.conf and pg_hba.conf files.
First, locate your PostgreSQL configuration directory. On Ubuntu, this is typically /etc/postgresql/14/main/ (version may vary).
Edit postgresql.conf:
# postgresql.conf listen_addresses = '*' wal_level = replica wal_sync_method = fsync wal_compression = on max_wal_senders = 5 wal_keep_size = 1024 # Or a sufficiently large value to accommodate network latency archive_mode = on archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f' # Ensure this directory exists and has correct permissions hot_standby = on
Next, edit pg_hba.conf to allow replication connections from the replica server. Replace <replica_ip_address> with the actual IP of your replica Droplet.
# pg_hba.conf # TYPE DATABASE USER ADDRESS METHOD host replication replicator <replica_ip_address>/32 scram-sha-256 host replication replicator 127.0.0.1/32 scram-sha-256 # For local replication testing host all all 0.0.0.0/0 scram-sha-256 # Adjust for security as needed
Create the WAL archive directory and set appropriate permissions:
sudo mkdir -p /var/lib/postgresql/wal_archive sudo chown -R postgres:postgres /var/lib/postgresql/wal_archive sudo chmod 700 /var/lib/postgresql/wal_archive
Create a replication user and grant necessary privileges:
-- Connect to PostgreSQL as the postgres user sudo -u postgres psql -- Inside psql: CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_strong_replication_password'; \q
Restart PostgreSQL to apply the changes:
sudo systemctl restart postgresql
Replica PostgreSQL Server Configuration
On the replica server, we’ll first stop PostgreSQL, clean its data directory, and then initialize it as a replica of the primary.
sudo systemctl stop postgresql
Remove the existing data directory (ensure no critical data is present here, this is a fresh setup):
sudo rm -rf /var/lib/postgresql/14/main/*
Initialize the replica using the pg_basebackup utility. Replace <primary_ip_address> and your_strong_replication_password.
sudo -u postgres pg_basebackup -h <primary_ip_address> -U replicator -D /var/lib/postgresql/14/main -vP -W
After the base backup completes, create a standby.signal file in the data directory to indicate it’s a standby server:
sudo touch /var/lib/postgresql/14/main/standby.signal
Edit postgresql.conf on the replica to enable hot standby. Ensure hot_standby is set to on. You might also want to adjust primary_conninfo if you’re using a different IP or port for the primary.
# postgresql.conf (on replica) hot_standby = on primary_conninfo = 'host=<primary_ip_address> port=5432 user=replicator password=your_strong_replication_password' # Optional: For faster failover detection, consider tuning recovery settings # recovery_timeout = 60 # seconds # standby_mode = 'on' # This is implicitly set by standby.signal, but good to be aware of
Start PostgreSQL on the replica:
sudo systemctl start postgresql
Verify replication status on both servers. On the primary:
sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"
You should see an entry for the replica. On the replica, check the PostgreSQL logs (e.g., /var/log/postgresql/postgresql-14-main.log) for messages indicating it’s connected and streaming.
Automating Failover with Patroni
While streaming replication ensures data availability, it doesn’t automate the failover process. For robust automated failover, we’ll integrate Patroni, a template for PostgreSQL HA.
Patroni requires a distributed configuration store. etcd is a popular choice. We’ll set up a small etcd cluster (or use a managed etcd service if available).
Setting up etcd
For a minimal setup, we can run etcd on a separate Droplet or even on the PostgreSQL nodes themselves (though a dedicated node is recommended for production). Install etcd:
# On etcd node(s) wget -qO- https://github.com/etcd-io/etcd/releases/download/v3.5.9/etcd-v3.5.9-linux-amd64.tar.gz | tar xz --strip-components=1 -C /usr/local/bin sudo systemctl enable etcd sudo systemctl start etcd
A basic etcd configuration (/etc/etcd/etcd.conf.yml) might look like this for a single node:
# /etc/etcd/etcd.conf.yml name: etcd-node-1 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://<etcd_ip_address>:2379 listen-peer-urls: http://0.0.0.0:2380 initial-advertise-peer-urls: http://<etcd_ip_address>:2380 initial-cluster: etcd-node-1=http://<etcd_ip_address>:2380 initial-cluster-token: my-etcd-cluster-token initial-cluster-state: new
Ensure the etcd service is running and accessible from your PostgreSQL nodes.
Installing and Configuring Patroni
Install Patroni on each PostgreSQL node. Using pip is a common method.
# On each PostgreSQL node sudo apt update sudo apt install python3-pip python3-dev build-essential libpq-dev -y sudo pip3 install --upgrade pip sudo pip3 install "patroni[etcd]" psycopg2-binary
Create a Patroni configuration file (e.g., /etc/patroni/patroni.yml) on each PostgreSQL node. This configuration needs to be consistent across all nodes, with slight variations for IP addresses.
# /etc/patroni/patroni.yml
scope: my_pg_cluster # Unique name for your PostgreSQL cluster
namespace: /service/ # Namespace within etcd for this cluster
# etcd configuration
etcd:
host: <etcd_ip_address>:2379
protocol: http
# PostgreSQL configuration
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/14/main
bin_dir: /usr/lib/postgresql/14/bin
config_dir: /etc/postgresql/14/main
pg_hba:
- host replication replicator <replica_ip_address>/32 scram-sha-256
- host replication replicator <primary_ip_address>/32 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
replication:
username: replicator
password: your_strong_replication_password
authentication:
replication:
username: replicator
password: your_strong_replication_password
superuser:
username: postgres
password: your_strong_postgres_password # Use a strong password
parameters:
wal_level: replica
max_wal_senders: 5
wal_keep_size: 1024
hot_standby: on
archive_mode: on
archive_command: 'cp %p /var/lib/postgresql/wal_archive/%f'
# Patroni REST API configuration
restapi:
listen: 0.0.0.0:8008
connect_address: <node_ip_address>:8008
# Tags for identifying nodes
tags:
nofailover: false
clonefrom: false
# Replication settings for initial bootstrap and failover
replication:
synchronous_mode: false # Set to true for synchronous replication if latency permits
synchronous_node_count: 1 # If synchronous_mode is true
# Bootstrap configuration (only used if no cluster exists in etcd)
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB, adjust as needed
postgresql:
use_pg_rewind: true
recovery_conf:
standby_mode: 'on'
primary_conninfo: 'host=%s port=%d user=%s password=%s'
# Optional: tune recovery settings for faster failover
# recovery_timeout: 60
# wal_receiver_status_interval: 10
# wal_retrieve_retry_interval: 5
# Logging configuration
log:
level: INFO
dir: /var/log/patroni
file: patroni.log
Ensure the <node_ip_address> in the restapi.connect_address is the correct IP for each respective node. Also, create the log directory:
sudo mkdir -p /var/log/patroni sudo chown -R patroni:patroni /var/log/patroni
Create a systemd service file for Patroni (e.g., /etc/systemd/system/patroni.service):
[Unit] Description=Patroni PostgreSQL High-Availability After=network.target [Service] User=patroni Group=patroni ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target
Start and enable the Patroni service on all PostgreSQL nodes:
sudo systemctl daemon-reload sudo systemctl start patroni sudo systemctl enable patroni
Monitor the Patroni logs. The first node to start will bootstrap the cluster. Subsequent nodes will join as replicas. Patroni will automatically manage leader election and failover.
Integrating Laravel with Patroni’s API
Your Laravel application needs to be aware of the current primary PostgreSQL instance. Patroni exposes a REST API that provides this information. We’ll create a simple service or helper in Laravel to query this API and dynamically update the database connection details.
First, ensure your Laravel application can reach the Patroni REST API endpoints (port 8008) on your PostgreSQL nodes. You might need to configure firewall rules on DigitalOcean.
Create a new service provider or a dedicated service class in your Laravel application (e.g., app/Services/PostgresHA.php).
namespace App\Services;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;
class PostgresHA
{
protected array $patroniNodes;
protected string $cacheKey = 'patroni_primary_db_config';
protected int $cacheDuration = 60; // Cache for 60 seconds
public function __construct()
{
// Load these from your .env file
$this->patroniNodes = explode(',', env('PATRONI_NODES', 'http://node1:8008,http://node2:8008'));
}
public function getPrimaryDatabaseConfig(): ?array
{
// Try to get from cache first
if (Cache::has($this->cacheKey)) {
return Cache::get($this->cacheKey);
}
foreach ($this->patroniNodes as $nodeUrl) {
try {
$response = Http::timeout(5)->get("{$nodeUrl}/primary");
if ($response->successful()) {
$data = $response->json();
if (isset($data['host']) && isset($data['port'])) {
$config = [
'host' => $data['host'],
'port' => $data['port'],
'database' => env('DB_DATABASE'),
'username' => env('DB_USERNAME'),
'password' => env('DB_PASSWORD'),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'strict' => true,
'engine' => null,
];
Cache::put($this->cacheKey, $config, now()->addSeconds($this->cacheDuration));
return $config;
}
}
} catch (\Exception $e) {
Log::error("Failed to connect to Patroni node {$nodeUrl}: " . $e->getMessage());
}
}
Log::error("Could not determine primary PostgreSQL node from Patroni.");
return null;
}
public function updateDatabaseConnection(): bool
{
$config = $this->getPrimaryDatabaseConfig();
if ($config) {
config(['database.connections.pgsql' => $config]);
return true;
}
return false;
}
}
Add the Patroni nodes to your .env file:
PATRONI_NODES=http://<node1_ip>:8008,http://<node2_ip>:8008
Now, you need to ensure this configuration is applied before your application tries to access the database. A good place is within your config/database.php or a service provider that runs early in the application’s bootstrap process.
Modify config/database.php to dynamically fetch the connection details:
use App\Services\PostgresHA;
// ... other configurations
'connections' => [
// ... other connections
'pgsql' => [
'driver' => 'pgsql',
'url' => env('DATABASE_URL'),
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '5432'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'charset' => 'utf8mb4',
'prefix' => '',
'prefix_indexes' => true,
'schema' => 'public',
'sslmode' => 'prefer',
],
// ... other connections
],
// ...
// Add this logic at the end of the file, or within a service provider
// Ensure PostgresHA service is resolved correctly
if (env('APP_ENV') !== 'testing') { // Avoid running in tests if not configured
try {
$postgresHA = app(PostgresHA::class);
if ($postgresHA->updateDatabaseConnection()) {
// Configuration updated, Laravel will now use the dynamic settings
// You might want to log this for debugging
Log::info('Database connection dynamically updated via Patroni.');
} else {
Log::error('Failed to dynamically update database connection via Patroni.');
// Fallback to default or throw an exception
// For critical applications, you might want to halt execution here
}
} catch (\Exception $e) {
Log::error('Error initializing PostgresHA service: ' . $e->getMessage());
// Handle error, potentially halt or use fallback
}
}
Alternatively, and often cleaner, is to use a dedicated service provider:
// app/Providers/DatabaseServiceProvider.php
namespace App\Providers;
use App\Services\PostgresHA;
use Illuminate\Support\ServiceProvider;
use Illuminate\Support\Facades\Log;
use Illuminate\Support\Facades\DB;
class DatabaseServiceProvider extends ServiceProvider
{
/**
* Register services.
*/
public function register(): void
{
//
}
/**
* Bootstrap services.
*/
public function boot(): void
{
if (env('APP_ENV') !== 'testing' && env('DB_CONNECTION') === 'pgsql') {
try {
$postgresHA = $this->app->make(PostgresHA::class);
if ($postgresHA->updateDatabaseConnection()) {
Log::info('Database connection dynamically updated via Patroni.');
} else {
Log::error('Failed to dynamically update database connection via Patroni.');
// Consider throwing an exception to halt if critical
// throw new \RuntimeException('Database connection failed to initialize.');
}
} catch (\Exception $e) {
Log::error('Error bootstrapping database connection: ' . $e->getMessage());
// Consider throwing an exception
// throw new \RuntimeException('Database connection failed to initialize.');
}
}
}
}
Register this service provider in config/app.php:
// config/app.php
// ...
'providers' => [
// ...
App\Providers\DatabaseServiceProvider::class,
// ...
],
// ...
Testing Failover
To test the failover, you can simulate a failure of the primary PostgreSQL instance. The safest way is to stop the Patroni service on the current primary node:
# On the current primary node sudo systemctl stop patroni
Observe the Patroni logs on the remaining nodes. Patroni should detect the primary’s absence, elect a new primary from the available replicas, and reconfigure them. Your Laravel application, upon its next database access (or after its cache for the primary config expires), should query Patroni again and connect to the new primary.
You can also manually trigger a failover using Patroni’s API, though stopping the service is a more direct test of its failure detection capabilities.
Considerations for Production
- Monitoring: Implement robust monitoring for PostgreSQL, Patroni, and etcd. Alert on replication lag, Patroni health checks, and etcd cluster status.
- Network: Ensure low latency and high bandwidth between your PostgreSQL nodes and between PostgreSQL nodes and etcd. Use DigitalOcean’s private networking.
- Security: Use SSL/TLS for etcd communication and PostgreSQL replication. Secure your Patroni API endpoints.
- Backups: Automated failover is not a substitute for backups. Implement a separate, reliable backup strategy (e.g., using
pg_dumpor DigitalOcean’s managed database backups if applicable). - Read Replicas: For read-heavy workloads, provision additional read replicas and configure your application to direct read traffic to them. Patroni can manage these as well.
- Testing: Regularly test your failover procedures in a staging environment that mirrors production.
- Configuration Management: Use tools like Ansible, Chef, or Puppet to manage your PostgreSQL and Patroni configurations across all nodes.