Server Monitoring Best Practices: Keeping Your Laravel App and PostgreSQL Clusters Alive on Google Cloud
Proactive PostgreSQL Health Checks with pg_cron and Custom Scripts
Maintaining the health of a PostgreSQL cluster, especially in a distributed cloud environment like Google Cloud, requires more than just reactive alerts. Proactive, automated checks are crucial for identifying and mitigating potential issues before they impact your Laravel application. We’ll leverage pg_cron, a PostgreSQL extension, for in-database scheduling and combine it with custom shell scripts for comprehensive health assessments.
First, ensure pg_cron is installed and enabled on your PostgreSQL instances. This typically involves modifying postgresql.conf and restarting the PostgreSQL service. On Google Cloud SQL for PostgreSQL, this is often managed through instance settings or by enabling extensions via the Cloud Console.
Once enabled, you can schedule SQL commands directly. A fundamental check is monitoring connection counts and identifying long-running queries. We’ll create a stored procedure to encapsulate this logic.
PostgreSQL Health Check Stored Procedure
This procedure will log critical metrics to a dedicated table. We’ll create this table first.
CREATE TABLE IF NOT EXISTS pg_health_logs (
log_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
instance_name TEXT,
active_connections INT,
max_connections INT,
longest_query_duration INTERVAL,
idle_in_transaction_aborted_sessions INT,
replication_lag_seconds INT NULL
);
Now, the stored procedure. This example assumes you have a way to identify your instance (e.g., an environment variable or a hardcoded name for simplicity in this example). For production, consider a more dynamic approach.
CREATE OR REPLACE FUNCTION log_pg_health()
RETURNS VOID AS $$
DECLARE
v_instance_name TEXT := COALESCE(current_setting('custom.instance_name', true), 'unknown_instance');
v_active_connections INT;
v_max_connections INT;
v_longest_query_duration INTERVAL;
v_idle_in_transaction_aborted_sessions INT;
v_replication_lag_seconds INT := NULL;
BEGIN
-- Get active and max connections
SELECT count(*), setting::int
INTO v_active_connections, v_max_connections
FROM pg_stat_activity, pg_settings
WHERE pg_settings.name = 'max_connections';
-- Get longest running query
SELECT COALESCE(EXTRACT(EPOCH FROM MAX(now() - query_start)), 0)
INTO v_longest_query_duration
FROM pg_stat_activity
WHERE state = 'active' AND query_start IS NOT NULL;
-- Get sessions with idle in transaction that are aborted
SELECT COUNT(*)
INTO v_idle_in_transaction_aborted_sessions
FROM pg_stat_activity
WHERE state = 'idle in transaction (aborted)';
-- Get replication lag (if applicable, requires superuser or specific grants)
-- This is a simplified example; robust replication monitoring is more complex.
-- For Cloud SQL, consider using Cloud Monitoring metrics for replication.
-- IF pg_is_in_recovery() THEN
-- SELECT COALESCE(EXTRACT(EPOCH FROM MAX(now() - pg_last_xact_replay_timestamp())), 0)
-- INTO v_replication_lag_seconds
-- FROM pg_stat_replication;
-- END IF;
INSERT INTO pg_health_logs (
instance_name,
active_connections,
max_connections,
longest_query_duration,
idle_in_transaction_aborted_sessions,
replication_lag_seconds
) VALUES (
v_instance_name,
v_active_connections,
v_max_connections,
v_longest_query_duration,
v_idle_in_transaction_aborted_sessions,
v_replication_lag_seconds
);
EXCEPTION WHEN OTHERS THEN
-- Log any errors encountered during the health check itself
RAISE WARNING 'Error during pg_health_logs: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;
Now, schedule this function using pg_cron. We’ll run it every 5 minutes.
-- Schedule the health check to run every 5 minutes
SELECT cron.schedule('5 min', 'SELECT log_pg_health();');
To make the custom.instance_name setting available, you can add it to postgresql.conf or set it dynamically. For Cloud SQL, you might need to use a startup script or a custom configuration parameter if available.
Laravel Application Health Checks via HTTP Endpoints
Your Laravel application needs its own health check mechanisms. A common and effective approach is to expose dedicated HTTP endpoints that perform checks on critical dependencies.
Create a new controller, for example, HealthCheckController.
<?php
namespace App\Http\Controllers;
use Illuminate\Http\JsonResponse;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Redis;
use Exception;
class HealthCheckController extends Controller
{
/**
* Perform a comprehensive health check.
*
* @return \Illuminate\Http\JsonResponse
*/
public function index()
{
$checks = [
'database' => $this->checkDatabaseConnection(),
'cache' => $this->checkCache(),
'redis' => $this->checkRedis(),
// Add more checks as needed (e.g., external APIs, file system permissions)
];
$allHealthy = collect($checks)->every(function ($status) {
return $status['status'] === 'ok';
});
$statusCode = $allHealthy ? 200 : 503; // Service Unavailable
return response()->json(array_merge([
'status' => $allHealthy ? 'ok' : 'error',
'message' => $allHealthy ? 'Application is healthy.' : 'Application is unhealthy.',
], $checks), $statusCode);
}
/**
* Check database connection.
*
* @return array
*/
protected function checkDatabaseConnection(): array
{
try {
DB::connection()->getPdo();
return ['status' => 'ok', 'message' => 'Database connection successful.'];
} catch (Exception $e) {
return ['status' => 'error', 'message' => 'Database connection failed: ' . $e->getMessage()];
}
}
/**
* Check cache service.
*
* @return array
*/
protected function checkCache(): array
{
try {
$key = '_health_check_cache_test_' . uniqid();
Cache::put($key, 'test', 1); // Try to put an item
if (Cache::has($key)) {
Cache::forget($key); // Clean up
return ['status' => 'ok', 'message' => 'Cache service is responsive.'];
}
return ['status' => 'error', 'message' => 'Cache service failed to store/retrieve data.'];
} catch (Exception $e) {
return ['status' => 'error', 'message' => 'Cache service error: ' . $e->getMessage()];
}
}
/**
* Check Redis connection.
*
* @return array
*/
protected function checkRedis(): array
{
try {
// Assuming Redis is configured in config/database.php and used for caching or sessions
// If Redis is used for other purposes, adjust the connection check.
$redis = Redis::connection();
$redis->ping(); // Simple command to check connectivity
return ['status' => 'ok', 'message' => 'Redis is responsive.'];
} catch (Exception $e) {
return ['status' => 'error', 'message' => 'Redis connection failed: ' . $e->getMessage()];
}
}
}
Register this controller in your routes/api.php (or routes/web.php if appropriate).
use App\Http\Controllers\HealthCheckController;
Route::get('/health', [HealthCheckController::class, 'index']);
This endpoint will return a JSON response indicating the status of each checked component. A 200 OK status code signifies all checks passed, while a 503 Service Unavailable indicates one or more failures.
Google Cloud Monitoring Integration
Google Cloud’s operations suite (formerly Stackdriver) is your primary tool for aggregating metrics, logs, and traces. We’ll configure it to ingest our custom PostgreSQL logs and application health check results.
Ingesting PostgreSQL Health Logs
To get the pg_health_logs into Cloud Monitoring, you can use the Cloud Logging agent (Ops Agent) and configure it to tail the PostgreSQL logs where these entries are written. If you’re writing to a specific file, configure the agent to collect it. Alternatively, you can use a custom script to periodically query pg_health_logs and push structured logs to Cloud Logging.
Let’s assume you have a script that queries the logs and sends them to Cloud Logging. Here’s a Python example using the Google Cloud client libraries.
import os
import time
from google.cloud import logging_v2
from google.api_core import exceptions
import psycopg2
from datetime import datetime, timezone
# --- Configuration ---
DB_HOST = os.environ.get("DB_HOST", "localhost")
DB_PORT = os.environ.get("DB_PORT", "5432")
DB_NAME = os.environ.get("DB_NAME", "mydatabase")
DB_USER = os.environ.get("DB_USER", "myuser")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "mypassword")
INSTANCE_NAME = os.environ.get("INSTANCE_NAME", "unknown_instance")
LOGGING_INTERVAL_SECONDS = int(os.environ.get("LOGGING_INTERVAL_SECONDS", 60))
CLOUD_LOGGING_NAME = "postgresql-health"
CLOUD_LOGGING_RESOURCE_TYPE = "cloudsql_database" # Or 'generic_node' if not using Cloud SQL
CLOUD_LOGGING_RESOURCE_LABELS = {
"project_id": os.environ.get("GOOGLE_CLOUD_PROJECT"),
"database_id": os.environ.get("CLOUD_SQL_INSTANCE_ID", "my-cloudsql-instance"), # Replace with your Cloud SQL instance ID
"region": os.environ.get("CLOUD_SQL_REGION", "us-central1"),
}
# If not using Cloud SQL, adjust resource labels, e.g.:
# CLOUD_LOGGING_RESOURCE_TYPE = "generic_node"
# CLOUD_LOGGING_RESOURCE_LABELS = {
# "project_id": os.environ.get("GOOGLE_CLOUD_PROJECT"),
# "location": os.environ.get("GCE_ZONE", "us-central1-a"),
# "namespace": "your-app-namespace",
# "node_id": os.environ.get("HOSTNAME", "node-1"),
# }
# --- Initialize Cloud Logging Client ---
logging_client = logging_v2.Client()
logger = logging_client.logger(CLOUD_LOGGING_NAME)
def get_db_connection():
try:
conn = psycopg2.connect(
host=DB_HOST,
port=DB_PORT,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASSWORD
)
return conn
except psycopg2.OperationalError as e:
print(f"Database connection error: {e}")
return None
def fetch_health_logs(conn):
try:
with conn.cursor() as cur:
# Fetch recent logs, e.g., from the last LOGGING_INTERVAL_SECONDS
# Adjust the query to fetch logs generated since the last run.
# For simplicity, we'll fetch all logs here and rely on the script's execution interval.
# A more robust solution would use a timestamp column to fetch only new entries.
cur.execute("""
SELECT log_timestamp, instance_name, active_connections, max_connections,
longest_query_duration, idle_in_transaction_aborted_sessions, replication_lag_seconds
FROM pg_health_logs
ORDER BY log_timestamp DESC
LIMIT 1; -- Fetch the latest log entry
""")
return cur.fetchone()
except psycopg2.Error as e:
print(f"Error fetching health logs: {e}")
return None
def log_to_cloud(log_entry):
if not log_entry:
return
timestamp, instance_name, active_connections, max_connections, \
longest_query_duration, idle_in_transaction_aborted_sessions, replication_lag_seconds = log_entry
# Convert interval to seconds for easier metric handling
longest_query_duration_sec = longest_query_duration.total_seconds() if longest_query_duration else 0
replication_lag_sec = replication_lag_seconds if replication_lag_seconds is not None else 0
log_payload = {
"instance_name": instance_name,
"active_connections": active_connections,
"max_connections": max_connections,
"longest_query_duration_seconds": longest_query_duration_sec,
"idle_in_transaction_aborted_sessions": idle_in_transaction_aborted_sessions,
"replication_lag_seconds": replication_lag_sec,
"log_timestamp_utc": timestamp.isoformat(),
}
try:
logger.log_struct(
log_payload,
severity="INFO",
timestamp=timestamp,
resource_type=CLOUD_LOGGING_RESOURCE_TYPE,
resource_labels=CLOUD_LOGGING_RESOURCE_LABELS
)
print(f"Logged health data for {instance_name} at {timestamp}")
except exceptions.GoogleAPIError as e:
print(f"Failed to log to Cloud Logging: {e}")
def main():
conn = get_db_connection()
if not conn:
print("Exiting due to database connection failure.")
return
try:
while True:
log_entry = fetch_health_logs(conn)
log_to_cloud(log_entry)
time.sleep(LOGGING_INTERVAL_SECONDS)
except KeyboardInterrupt:
print("Exiting script.")
finally:
if conn:
conn.close()
if __name__ == "__main__":
main()
Deploy this script to a Compute Engine instance or a GKE pod that has network access to your PostgreSQL cluster. Ensure the service account running the script has the roles/logging.logWriter IAM role. Configure the environment variables correctly, especially CLOUD_LOGGING_RESOURCE_LABELS to match your Cloud SQL instance details.
Monitoring Application Health Endpoints
Google Cloud Monitoring can directly monitor HTTP endpoints. You can set up an “External Uptime Check” to periodically hit your /health endpoint.
Steps in Google Cloud Console:
- Navigate to Monitoring > Uptime checks.
- Click Create uptime check.
- Title: e.g., “Laravel App Health Check”
- Check type: HTTP(S)
- Resource type: Global
- Protocol: HTTP
- Hostname: Your application’s public hostname or IP address.
- Path:
/health - Check frequency: e.g., 1 minute.
- Timeout: e.g., 30 seconds.
- Response content validation: (Optional but recommended) You can validate that the response body contains
"status": "ok". - Alerting: Configure an alerting policy to notify you (e.g., via PagerDuty, Slack, email) when the uptime check fails.
This provides an external perspective on your application’s availability. For more granular insights, you can also push the results of your Laravel health check endpoint to Cloud Logging manually from within the application or via a separate collector.
Custom Metrics and Alerting Thresholds
With PostgreSQL health data flowing into Cloud Logging, you can create custom metrics. These metrics can then be used to build dashboards and, crucially, set up alerting policies.
Example Custom Metrics:
- High Active Connections: Alert when
active_connectionsexceeds 80% ofmax_connections. - Long Running Queries: Alert when
longest_query_duration_secondsexceeds a defined threshold (e.g., 60 seconds). - Aborted Transactions: Alert immediately if
idle_in_transaction_aborted_sessionsis greater than 0. - Replication Lag: Alert if
replication_lag_secondsexceeds a critical threshold (e.g., 30 seconds).
Steps in Google Cloud Console:
- Navigate to Monitoring > Metrics explorer.
- Select resource type: e.g.,
Cloud SQL DatabaseorGeneric Node. - Find your custom metric (e.g.,
logging.googleapis.com/user/postgresql_health/active_connections– you might need to define this metric based on the log structure). - Use the Aggregator and Aligner functions to visualize the data.
- Click Create Alert from the Metrics Explorer or navigate to Alerting > Create policy.
- Configure the condition based on your custom metric and desired thresholds.
- Set notification channels and documentation.
For the Laravel application health endpoint, you can create a metric based on the HTTP status code returned by the uptime check. If the uptime check returns 503, Cloud Monitoring will automatically generate an `uptime_check/check_passed` metric with a value of 0, which can trigger alerts.
Advanced Considerations: Distributed Tracing and Log Correlation
For complex applications, correlating logs and metrics across different services (Laravel app, PostgreSQL, load balancers, etc.) is vital for root cause analysis. Consider implementing distributed tracing using tools like OpenTelemetry. Ensure your Laravel application is configured to propagate trace context, and that your PostgreSQL logs (if using extensions like pg_stat_statements with custom logging) can include trace IDs.
Google Cloud’s operations suite integrates with Cloud Trace and Cloud Logging, allowing you to jump from a trace span to relevant logs, significantly speeding up debugging.
By combining proactive database health checks, robust application-level monitoring, and comprehensive cloud-native observability tools, you can build a resilient and highly available Laravel application on Google Cloud.