Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and Shopify Deployments on Linode
Establishing a High-Availability PostgreSQL Cluster on Linode
Achieving true disaster recovery for a critical application like Shopify necessitates a robust, automated failover strategy for its data layer. For PostgreSQL, this means moving beyond simple replication to a setup that can detect primary node failure and seamlessly promote a replica without manual intervention. We’ll architect this using Patroni, a template for PostgreSQL HA, and leverage Linode’s infrastructure for health checks and load balancing.
Patroni: The Core of PostgreSQL HA
Patroni simplifies PostgreSQL HA by managing replication, failover, and configuration. It relies on a distributed consensus store (like etcd, Consul, or ZooKeeper) to maintain cluster state. For this setup, we’ll use etcd due to its simplicity and widespread adoption.
Setting up etcd Cluster
A minimum of three etcd nodes is recommended for quorum and fault tolerance. Deploy these on separate Linode instances, ideally in different availability zones if your region supports it.
etcd Configuration (etcd.conf.yml)
On each etcd node, create an etcd.conf.yml file. This example assumes a simple setup; production environments should incorporate TLS for secure communication.
name: etcd-node-1 data-dir: /var/lib/etcd listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://:2379 listen-peer-urls: http://0.0.0.0:2380 initial-advertise-peer-urls: http:// :2380 initial-cluster: etcd-node-1=http:// :2380,etcd-node-2=http:// :2380,etcd-node-3=http:// :2380 initial-cluster-state: new cluster-state: new
Replace <ETCD_NODE_X_IP> with the actual private IP addresses of your etcd nodes. Start the etcd service:
sudo systemctl enable etcd sudo systemctl start etcd
Deploying Patroni with PostgreSQL
We’ll deploy Patroni and PostgreSQL on separate Linode instances. A common pattern is to have at least two PostgreSQL nodes (one primary, one replica) and potentially more for read scaling. Patroni will manage the promotion of replicas.
Patroni Configuration (patroni.yml)
Create a patroni.yml file on each PostgreSQL node. This configuration points to the etcd cluster and defines PostgreSQL settings.
scope: shopify-pg-cluster
namespace: /service/
restapi:
listen: 0.0.0.0:8008
connect_address: <POSTGRES_NODE_IP>:8008
etcd:
host: <ETCD_NODE_1_IP>:2379, <ETCD_NODE_2_IP>:2379, <ETCD_NODE_3_IP>:2379
postgresql:
listen: 0.0.0.0:5432
connect_address: <POSTGRES_NODE_IP>:5432
data_dir: /var/lib/postgresql/14/main
bin_dir: /usr/lib/postgresql/14/bin
pg_hba:
- host all all 0.0.0.0/0 md5
parameters:
max_connections: 100
shared_buffers: 1GB
effective_cache_size: 3GB
maintenance_work_mem: 512MB
logging_collector: on
log_directory: /var/log/postgresql
log_filename: postgresql-%Y-%m-%d_%H-%M-%S.log
log_statement: 'ddl'
log_replication_commands: on
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
synchronous_commit: on
synchronous_standby_names: "1" # Set to '1' for synchronous replication, or a replica name for specific sync
replication:
synchronous: true # Set to true for synchronous replication
synchronous_mode: 'remote_write' # or 'remote_apply'
retry_timeout: 10
max_retries: 5
Replace <POSTGRES_NODE_IP> with the IP of the current PostgreSQL node and <ETCD_NODE_X_IP> with your etcd node IPs. Ensure the PostgreSQL user has permissions to create databases and roles. Install Patroni and PostgreSQL on each node. Then, start Patroni:
sudo systemctl enable patroni sudo systemctl start patroni
Health Checks and Load Balancing with Linode NodeBalancers
Linode NodeBalancers are crucial for directing traffic to the active PostgreSQL primary. We’ll configure a NodeBalancer to listen on port 5432 and forward traffic to the Patroni REST API endpoints (port 8008) on each PostgreSQL node for health checks.
NodeBalancer Configuration
Create a NodeBalancer in the Linode Cloud Manager. Add a frontend listener for port 5432 (TCP). For the backend nodes, add your PostgreSQL instances. Configure the health check to target the Patroni API’s /primary endpoint. A simple check would be:
- Protocol: HTTP
- Port: 8008
- Path: /primary
- Check Interval: 5s
- Response Timeout: 2s
- Healthy Threshold: 2
- Unhealthy Threshold: 3
This setup ensures that the NodeBalancer only sends traffic to the node that Patroni has designated as the primary. When a failover occurs, Patroni updates etcd, and the NodeBalancer’s health checks will automatically redirect traffic to the newly promoted primary.
Automating Shopify Application Failover
While PostgreSQL handles its own data layer failover, the Shopify application instances need to be aware of and adapt to this change. This involves updating connection strings and potentially restarting services.
Application Configuration Management
Store your database connection details in a configuration file or environment variables that can be dynamically updated. For example, a config/database.php in a Laravel application:
<?php
return [
'default' => env('DB_CONNECTION', 'mysql'),
'connections' => [
'pgsql' => [
'driver' => 'pgsql',
'host' => env('DB_HOST', 'localhost'), // This will be the NodeBalancer IP
'port' => env('DB_PORT', '5432'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'charset' => 'utf8',
'prefix' => '',
'schema' => 'public',
'sslmode' => 'prefer',
],
],
];
?>
The DB_HOST environment variable should point to the IP address of your Linode NodeBalancer for PostgreSQL.
Implementing a Failover Trigger Mechanism
When a PostgreSQL failover occurs, Patroni can be configured to execute custom scripts. This is done via the callbacks section in patroni.yml.
Callback Script Example (/opt/patroni/scripts/on_role_change.sh)
This script will be executed on the node that changes its role (e.g., from replica to primary). It needs to trigger an update of application configuration or restart application services.
#!/bin/bash
ROLE=$1
CLUSTER_NAME=$2
ETCD_HOSTS="$3" # Comma-separated list of etcd hosts
# Function to get the current primary IP from Patroni API
get_primary_ip() {
# Iterate through Patroni REST API endpoints to find the primary
# This is a simplified example; a more robust solution might query etcd directly
# or have a dedicated service discovery mechanism.
# For simplicity, we assume Patroni REST API is accessible on each node.
# A better approach would be to query etcd for the leader key.
local leader_key="/service/shopify-pg-cluster/leader"
local leader_info=$(etcdctl --endpoints="$ETCD_HOSTS" get "$leader_key" --print-value-only 2>&1)
if [[ $? -eq 0 && -n "$leader_info" ]]; then
echo "$leader_info" | jq -r '.host' | sed 's/:.*//'
else
echo ""
fi
}
# Function to update application configuration (e.g., .env file)
update_app_config() {
local primary_ip=$1
if [ -z "$primary_ip" ]; then
echo "Error: Could not determine primary PostgreSQL IP."
exit 1
fi
echo "Detected PostgreSQL primary change. New primary IP: $primary_ip"
# Example: Update .env file for a Laravel application
# This requires the application to be configured to read DB_HOST from .env
if [ -f "/var/www/shopify-app/.env" ]; then
sed -i "s/^DB_HOST=.*/DB_HOST=$primary_ip/" "/var/www/shopify-app/.env"
echo "Updated DB_HOST in /var/www/shopify-app/.env"
# Optionally, clear cache or restart services
# cd /var/www/shopify-app && php artisan config:cache && sudo systemctl restart php-fpm
else
echo "Warning: .env file not found at /var/www/shopify-app/.env. Manual intervention may be required."
fi
}
# Main logic
if [ "$ROLE" == "master" ]; then
# This node has become the primary
PRIMARY_IP=$(get_primary_ip)
update_app_config "$PRIMARY_IP"
# Potentially restart application services here if not handled by config:cache
# sudo systemctl restart your-app-service
elif [ "$ROLE" == "replica" ]; then
# This node has become a replica
echo "Node $HOSTNAME is now a replica."
# No action needed for application config, as it points to NodeBalancer
fi
exit 0
Make the script executable and add it to your Patroni configuration:
# ... within patroni.yml ... callbacks: on_role_change: /opt/patroni/scripts/on_role_change.sh
The get_primary_ip function in the script is a simplified example. A more robust approach would involve querying etcd directly for the leader key or using a dedicated service discovery tool. The script assumes etcdctl is installed and configured to reach the etcd cluster. The update_app_config function demonstrates updating a Laravel .env file. Adjust this part based on your application’s configuration management strategy.
Orchestrating Application Restarts
After the configuration is updated, application services might need to be restarted to pick up the new database host. This can be done directly within the callback script or orchestrated by a configuration management tool like Ansible or Chef. For immediate effect, a `systemctl restart` command for your application’s service (e.g., `php-fpm`, `gunicorn`, `uwsgi`) can be added to the callback script.
Testing and Validation
Thorough testing is paramount. Simulate failures by:
- Stopping the PostgreSQL process on the primary node.
- Stopping the Patroni process on the primary node.
- Simulating network partitions between PostgreSQL nodes and etcd.
- Restarting the NodeBalancer health check interval to observe failover timing.
Monitor etcd, Patroni logs, PostgreSQL logs, and application logs during these tests to ensure the failover is smooth and the application correctly reconnects to the new primary. Verify that the NodeBalancer correctly directs traffic only to the active primary.