Server Monitoring Best Practices: Keeping Your Laravel App and Elasticsearch Clusters Alive on DigitalOcean

Proactive Health Checks for Laravel Applications

Maintaining the health of a Laravel application deployed on DigitalOcean requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to ensure the application itself is responsive and its critical components are functioning. This involves implementing application-level health checks and integrating them with a robust monitoring system.

Implementing a Laravel Health Check Endpoint

A dedicated health check endpoint within your Laravel application is the first line of defense. This endpoint should perform essential checks, such as database connectivity, cache availability, and the status of any critical external services. We’ll create a simple controller and route for this.

First, generate a new controller:

php artisan make:controller HealthCheckController

Next, define the health check logic within the controller. This example checks database connectivity and Redis availability.

<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Log;

class HealthCheckController extends Controller
{
    /**
     * Perform a comprehensive health check.
     *
     * @return \Illuminate\Http\JsonResponse
     */
    public function index()
    {
        $checks = [];
        $status = 200;

        // Check Database Connection
        try {
            DB::connection()->getPdo();
            $checks['database'] = 'OK';
        } catch (\Exception $e) {
            $checks['database'] = 'FAILED: ' . $e->getMessage();
            $status = 503; // Service Unavailable
            Log::error('Database connection failed: ' . $e->getMessage());
        }

        // Check Cache (assuming Redis)
        try {
            Cache::put('health_check_test', 'value', 1);
            if (Cache::get('health_check_test') === 'value') {
                $checks['cache'] = 'OK';
                Cache::forget('health_check_test');
            } else {
                $checks['cache'] = 'FAILED: Could not write/read from cache.';
                $status = 503;
                Log::error('Cache write/read failed.');
            }
        } catch (\Exception $e) {
            $checks['cache'] = 'FAILED: ' . $e->getMessage();
            $status = 503;
            Log::error('Cache connection failed: ' . $e->getMessage());
        }

        // Add more checks here (e.g., external API calls, queue status)

        return response()->json($checks, $status);
    }
}

Now, register a route for this endpoint in routes/api.php. It’s crucial to protect this route, especially in production, by using middleware that limits access to trusted IP addresses or internal networks.

use App\Http\Controllers\HealthCheckController;

Route::get('/health', [HealthCheckController::class, 'index'])->middleware('throttle:100,1'); // Basic throttling

For production, consider a more robust middleware that restricts access by IP. You can create a custom middleware for this purpose.

Integrating with DigitalOcean Monitoring and Uptime Checks

DigitalOcean’s built-in monitoring provides essential infrastructure metrics. However, for application-level checks, we need to leverage external services or configure DigitalOcean’s Uptime Checks.

DigitalOcean Uptime Checks:

Navigate to your Droplet in the DigitalOcean control panel.
Go to the “Monitoring” tab.
Under “Uptime Checks,” click “Add Uptime Check.”
Configure the check:
- Protocol: HTTP/HTTPS
- Port: 80 or 443
- Path: /health (the endpoint we created)
- Check Interval: e.g., 1 minute
- Alerting: Configure email alerts for failures.

This setup will ping your /health endpoint at the specified interval. If the endpoint returns a non-2xx status code or times out, DigitalOcean will trigger an alert. This is a good first step for external validation.

Advanced Monitoring with Prometheus and Grafana

For more granular control and richer visualization, integrating Prometheus and Grafana is a standard practice. We’ll use the prometheus-client PHP library to expose application metrics and configure Prometheus to scrape them.

Install the Prometheus client library:

composer require promphp/prometheus-client

Create a new endpoint to expose Prometheus metrics. This endpoint will be scraped by Prometheus.

use Prometheus\Render\RenderTextFormat;
use Prometheus\Storage\InMemory;
use Prometheus\CollectorRegistry;

// ... inside a new controller or a dedicated metrics handler

public function metrics()
{
    $registry = new CollectorRegistry(new InMemory());

    // Example: Gauge for active users (requires logic to track)
    $activeUsers = $registry->getOrRegisterGauge('myapp', 'active_users', 'Number of currently active users');
    $activeUsers->set(rand(10, 100)); // Replace with actual user count

    // Example: Counter for processed orders
    $ordersProcessed = $registry->getOrRegisterCounter('myapp', 'orders_processed_total', 'Total number of orders processed');
    // Increment this counter when an order is successfully processed

    // Example: Histogram for request duration
    $requestDuration = $registry->getOrRegisterHistogram('myapp', 'request_duration_seconds', 'Duration of HTTP requests', [0.1, 0.5, 1, 5, 10]);
    // Record request duration in middleware or controller

    $renderer = new RenderTextFormat();
    header('Content-Type: ' . RenderTextFormat::MIME_TYPE);
    echo $renderer->render($registry->getMetricFamilySamples());
    exit;
}

use App\Http\Controllers\MetricsController; // Assuming you put it in MetricsController

Route::get('/metrics', [MetricsController::class, 'metrics'])->middleware('auth.prometheus'); // Custom middleware for Prometheus IP restriction

You’ll need to create an auth.prometheus middleware to restrict access to your Prometheus server’s IP address.

Elasticsearch Cluster Health and Performance Monitoring

Monitoring Elasticsearch clusters is critical for maintaining the performance and availability of your search and logging infrastructure. This involves tracking cluster health, node status, indexing rates, search latency, and resource utilization.

Elasticsearch Cluster Health API

The Elasticsearch Cluster Health API (_cluster/health) provides a high-level overview of the cluster’s status. It returns information about the number of nodes, indices, shards, and the overall health status (green, yellow, red).

You can query this endpoint using curl:

curl -X GET "localhost:9200/_cluster/health?pretty"

A typical output:

{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : 0,
  "active_shards_percent_as_number" : 100.0
}

A status of green indicates that all primary and replica shards are allocated. yellow means all primary shards are allocated, but some replicas are not. red signifies that some primary shards are not allocated, meaning data might be unavailable.

Node Stats and Shard Allocation Monitoring

To dive deeper, monitor individual node statistics and shard allocation. The Node Stats API (_nodes/stats) provides detailed metrics on CPU usage, memory, disk I/O, network traffic, and JVM statistics for each node.

curl -X GET "localhost:9200/_nodes/stats?pretty"

The Shard Allocation API (_cluster/allocation/explain) is invaluable for diagnosing why shards are not being allocated (e.g., during a `yellow` or `red` cluster status).

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

This will provide detailed explanations for unassigned shards, helping you identify issues like insufficient disk space, node attribute mismatches, or shard balancing problems.

Monitoring Indexing and Search Performance

Slow indexing or search queries can cripple an application. Monitor the Indexing Performance API (_stats/indexing) and Search Performance API (_stats/search) to identify bottlenecks.

# Indexing stats
curl -X GET "localhost:9200/_stats/indexing?pretty"

# Search stats
curl -X GET "localhost:9200/_stats/search?pretty"

Key metrics to watch include:

Indexing: index_total, index_time_in_millis, throttle_time_in_millis (indicates indexing pressure).
Search: query_total, query_time_in_millis, fetch_total, fetch_time_in_millis. High values here, especially with increasing query_time_in_millis, point to slow queries.

Leveraging Elasticsearch Monitoring Tools

Elasticsearch offers its own monitoring solution, often integrated with Kibana. This provides a user-friendly dashboard for cluster health, node metrics, index performance, and more.

Enabling Elasticsearch Monitoring:

Ensure the x-pack.monitoring.enabled: true setting is present in your elasticsearch.yml configuration file.
Restart your Elasticsearch nodes.
Access the “Stack Monitoring” section in Kibana.

This built-in solution is excellent for a quick overview and alerts. For more advanced, custom metrics and integration with your existing monitoring stack (like Prometheus/Grafana), you can use:

Prometheus Exporters: Use community-developed Elasticsearch exporters for Prometheus (e.g., prometheus-elasticsearch-exporter) to scrape Elasticsearch metrics and feed them into your Prometheus instance.
Log Aggregation: Forward Elasticsearch logs (including slow logs) to a centralized logging system (like ELK stack or Loki) for analysis and alerting.

Alerting Strategies for Elasticsearch

Set up alerts for critical conditions:

Cluster Status: Alert immediately if the cluster status is yellow or red.
Node Health: Monitor CPU, memory, and disk usage on each node. Alert when thresholds are breached (e.g., disk usage > 85%).
Indexing/Search Latency: Alert on sustained high indexing or search latency.
Unassigned Shards: Alert on any unassigned shards.
JVM Heap Usage: Monitor JVM heap usage; high usage can lead to garbage collection pauses and performance degradation.

Configure these alerts within your chosen monitoring system (e.g., Prometheus Alertmanager, Grafana alerting, or Elasticsearch’s alerting features).