Disaster Recovery 101: Architecting Auto-Failovers for PostgreSQL and PHP Deployments on DigitalOcean

Designing for High Availability: PostgreSQL and PHP Auto-Failover on DigitalOcean

This document outlines a robust, automated failover strategy for a typical PostgreSQL and PHP application stack deployed on DigitalOcean. The core objective is to minimize downtime by automatically redirecting application traffic to a standby database instance in the event of a primary database failure. We will leverage DigitalOcean’s managed PostgreSQL, a simple yet effective health check mechanism, and a lightweight orchestration script.

Prerequisites and Setup

Before diving into the failover logic, ensure the following are in place:

Two DigitalOcean Managed PostgreSQL databases: one designated as primary, the other as standby. Both should be in the same region for minimal latency.
A DigitalOcean Droplet (or multiple for application redundancy) running your PHP application. This Droplet will host the failover orchestration script.
SSH access to the Droplet designated for orchestration.
Basic familiarity with PostgreSQL replication and SQL commands.
The psql command-line utility installed on the orchestration Droplet.

Configuring PostgreSQL Replication

DigitalOcean’s Managed PostgreSQL handles the underlying replication setup. When you create a second database cluster and designate it as a read-only replica of your primary, the streaming replication is automatically configured. The key is to ensure the standby database is indeed replicating from the primary. You can verify this via the DigitalOcean control panel or programmatically.

Health Check Mechanism

A reliable health check is paramount. We’ll implement a simple check that attempts to connect to the primary PostgreSQL instance and execute a trivial query. If this fails, we assume the primary is unhealthy.

PHP Application Configuration

Your PHP application’s database connection configuration should be dynamic. Instead of hardcoding the primary database endpoint, use environment variables or a configuration file that can be updated by our failover script. For this example, we’ll assume a configuration file config.php that is included by your application.

 PDO::ERRMODE_EXCEPTION]);
    // Optional: Perform a simple query to ensure connection is active
    $stmt = $pdo->query("SELECT 1");
    $stmt->execute();
    if ($stmt->fetchColumn() !== '1') {
        throw new PDOException("Database check query failed.");
    }
    // echo "Successfully connected to database: " . DB_HOST . "\n";
} catch (PDOException $e) {
    // In a real application, you'd log this and potentially trigger a notification
    // For failover, this exception is the signal to initiate the process.
    error_log("Database connection failed: " . $e->getMessage());
    // In a production scenario, you might want to redirect to a maintenance page
    // or attempt to connect to a known standby. For this script, we'll let it fail
    // and rely on the external script to fix it.
    throw $e; // Re-throw to indicate application failure
}
?>

The key here is the fallback to environment variables. This allows our failover script to dynamically set DB_HOST and potentially other connection parameters.

The Failover Orchestration Script (Bash)

We’ll create a Bash script that runs periodically (e.g., via cron) on the application server. This script will:

Attempt to connect to the primary PostgreSQL instance.
If the connection fails, initiate the failover process.
Promote the standby database to primary.
Update the application’s database configuration (via environment variables or file modification).
Perform a quick verification of the new primary.
Optionally, notify administrators.

Script Implementation

Save the following script as /opt/scripts/pgsql_failover.sh on your application server. Ensure it’s executable (`chmod +x /opt/scripts/pgsql_failover.sh`).

#!/bin/bash

# --- Configuration ---
PRIMARY_DB_HOST="your_primary_db_host.oregon.db.ondigitalocean.com"
STANDBY_DB_HOST="your_standby_db_host.oregon.db.ondigitalocean.com"
DB_PORT="25060"
DB_USER="doadmin"
DB_PASSWORD="your_db_password"
DB_NAME="your_database_name"
APP_CONFIG_FILE="/var/www/your_app/config.php" # Path to your PHP config file
ENV_VAR_FILE="/etc/your_app/.env" # Path to a .env file if you use one
LOG_FILE="/var/log/pgsql_failover.log"
NOTIFICATION_EMAIL="[email protected]"
HEALTH_CHECK_QUERY="SELECT 1"

# --- Functions ---
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

send_notification() {
    SUBJECT="PostgreSQL Failover Alert: $1"
    BODY="$2"
    echo "$BODY" | mail -s "$SUBJECT" "$NOTIFICATION_EMAIL"
    log_message "Notification sent: $SUBJECT"
}

check_primary_db() {
    PGPASSWORD="$DB_PASSWORD" psql -h "$PRIMARY_DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -c "$HEALTH_CHECK_QUERY" &>/dev/null
    return $?
}

update_app_config() {
    # Option 1: Update a PHP config file directly (requires careful parsing/writing)
    # This is more complex and prone to errors if the file format changes.
    # For simplicity, we'll focus on updating environment variables.

    # Option 2: Update a .env file (common with frameworks like Laravel)
    if [ -f "$ENV_VAR_FILE" ]; then
        log_message "Updating environment variable file: $ENV_VAR_FILE"
        # Use sed to replace the DB_HOST line. Assumes DB_HOST is on its own line.
        # Be cautious with special characters in your hostnames.
        sed -i "s/^DB_HOST=.*/DB_HOST=$STANDBY_DB_HOST/" "$ENV_VAR_FILE"
        # If your framework requires a cache clear or config reload, add it here.
        # Example for Laravel: php artisan config:cache
        # Example for Symfony: php bin/console cache:clear
        log_message "Successfully updated DB_HOST to $STANDBY_DB_HOST in $ENV_VAR_FILE"
        return 0
    else
        log_message "Error: Environment variable file not found at $ENV_VAR_FILE"
        return 1
    fi
}

promote_standby() {
    log_message "Attempting to promote standby database: $STANDBY_DB_HOST"
    PGPASSWORD="$DB_PASSWORD" psql -h "$STANDBY_DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_promote();" &>/dev/null
    if [ $? -eq 0 ]; then
        log_message "Standby database $STANDBY_DB_HOST successfully promoted."
        return 0
    else
        log_message "Error: Failed to promote standby database $STANDBY_DB_HOST."
        return 1
    fi
}

verify_new_primary() {
    log_message "Verifying connection to new primary: $STANDBY_DB_HOST"
    PGPASSWORD="$DB_PASSWORD" psql -h "$STANDBY_DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -c "$HEALTH_CHECK_QUERY" &>/dev/null
    return $?
}

# --- Main Logic ---
log_message "Starting PostgreSQL health check..."

if check_primary_db; then
    log_message "Primary database $PRIMARY_DB_HOST is healthy."
    exit 0
else
    log_message "Primary database $PRIMARY_DB_HOST is UNHEALTHY. Initiating failover."

    # 1. Promote the standby
    if promote_standby; then
        # 2. Update application configuration
        if update_app_config; then
            # 3. Verify the new primary
            if verify_new_primary; then
                log_message "Failover successful. New primary is $STANDBY_DB_HOST."
                send_notification "SUCCESSFUL" "PostgreSQL failover completed. New primary is $STANDBY_DB_HOST. Application configuration updated."
                exit 0
            else
                log_message "Error: Failed to verify connection to the new primary database $STANDBY_DB_HOST after promotion."
                send_notification "FAILED - Verification Error" "PostgreSQL failover initiated, standby promoted, but verification of new primary $STANDBY_DB_HOST failed. Manual intervention required."
                exit 1
            fi
        else
            log_message "Error: Failed to update application configuration after promoting standby."
            send_notification "FAILED - Config Update Error" "PostgreSQL failover initiated, standby promoted, but application configuration update failed. Manual intervention required."
            exit 1
        fi
    else
        log_message "Error: Failed to promote standby database $STANDBY_DB_HOST."
        send_notification "FAILED - Promotion Error" "PostgreSQL failover initiated, but promotion of standby $STANDBY_DB_HOST failed. Manual intervention required."
        exit 1
    fi
fi

Script Configuration Details

Crucially, update the placeholder values in the --- Configuration --- section of the script with your actual database credentials, hostnames, database name, and the correct paths to your application’s configuration files. The script prioritizes updating a .env file, which is a common pattern for managing environment-specific configurations in PHP applications and frameworks.

Automating the Health Check with Cron

To make this process automatic, schedule the script to run at regular intervals using cron. A check every 1-5 minutes is usually sufficient for most applications.

Cron Job Setup

Edit the crontab for the user that has permissions to write to the log file and potentially the .env file (often the web server user, e.g., www-data, or a dedicated deployment user).

# Edit crontab for the www-data user (or your application user)
sudo crontab -u www-data -e

# Add the following line to run the script every 2 minutes:
*/2 * * * * /opt/scripts/pgsql_failover.sh >> /var/log/pgsql_failover.log 2>&1

Note: Ensure the cron user has the necessary permissions to execute the script, write to the log file, and modify the .env file. The >> /var/log/pgsql_failover.log 2>&1 part redirects both standard output and standard error to the log file, which is essential for debugging.

Important Considerations and Enhancements

Replication Lag and Stale Reads

Streaming replication is not instantaneous. There can be a small lag between the primary and standby. During a failover, if your application performs reads immediately after the switch, it might read slightly stale data. For critical applications, consider:

Implementing a read-after-write consistency check in your application logic.
Using a more sophisticated monitoring tool that can detect replication lag and only trigger failover if the lag is within acceptable bounds.
For very high-consistency requirements, explore synchronous replication (though this impacts write performance).

Idempotency and Race Conditions

The script is designed to be idempotent: running it multiple times when the primary is down should not cause harm. However, in highly distributed systems or with very frequent cron jobs, race conditions are possible. For instance, two instances of the script could theoretically run simultaneously. Ensure your cron jobs are configured to prevent overlapping execution if this is a concern (e.g., using flock).

Database Credentials Management

Storing database passwords directly in scripts or configuration files is a security risk. For production environments, consider:

Using DigitalOcean’s secrets management or a dedicated secrets manager (like HashiCorp Vault).
Granting the orchestration script minimal necessary privileges.
Ensuring strict file permissions on sensitive configuration files.

Application-Level Failover Logic

For even more resilience, your PHP application itself could be aware of the standby database. When a connection error occurs, instead of just failing, it could attempt to connect to the standby. This provides an immediate, application-driven failover before the external script even kicks in. The external script then acts as a fallback and ensures the standby is promoted and configuration is permanently updated.

Monitoring and Alerting

While the script sends email notifications, integrate with a more comprehensive monitoring system (e.g., Prometheus, Datadog, New Relic) to:

Monitor the health check script’s execution.
Track replication lag.
Alert on failed failover attempts.
Monitor the overall health of both database instances.

Promoting the Old Primary

After a successful failover, the original primary database is now out of sync. It should be treated as a potential new standby. You will need to manually reconfigure it to replicate from the *new* primary and then potentially switch back during a planned maintenance window if desired, or keep the new primary as is.

Conclusion

This automated failover strategy provides a solid foundation for ensuring the high availability of your PostgreSQL and PHP deployments on DigitalOcean. By combining DigitalOcean’s managed services with a well-crafted orchestration script and robust monitoring, you can significantly reduce Mean Time To Recovery (MTTR) and protect your application from unexpected database outages.