Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on OVH

Proactive PHP Application Health Checks

Maintaining the health of a PHP application, especially one running on a cloud provider like OVH, requires more than just basic uptime checks. We need to go deeper, ensuring the application is not only reachable but also responsive and free from internal errors. This involves a multi-layered approach, combining external synthetic monitoring with internal application-level diagnostics.

For external checks, tools like Prometheus with Blackbox Exporter are invaluable. Blackbox Exporter allows us to probe endpoints using various protocols (HTTP, TCP, ICMP, DNS) from different geographical locations, simulating real user experience. However, this only tells us if the *server* is responding. To understand the *application’s* internal state, we need application-specific probes.

Implementing Application-Level Health Endpoints

A common and effective practice is to expose a dedicated health check endpoint within your PHP application. This endpoint should perform critical checks: database connectivity, essential service availability, and even basic application logic validation. The response should be simple and machine-readable, typically JSON.

Consider a basic health check endpoint in a Symfony application:

// src/Controller/HealthCheckController.php
namespace App\Controller;

use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
use Symfony\Component\HttpFoundation\JsonResponse;
use Symfony\Component\HttpFoundation\Response;
use Doctrine\ORM\EntityManagerInterface; // Assuming Doctrine for DB check

class HealthCheckController extends AbstractController
{
    private EntityManagerInterface $entityManager;

    public function __construct(EntityManagerInterface $entityManager)
    {
        $this->entityManager = $entityManager;
    }

    /**
     * @Route("/health", name="app_health_check", methods={"GET"})
     */
    public function index(): JsonResponse
    {
        $status = 'ok';
        $checks = [];

        // 1. Database Connectivity Check
        try {
            $connection = $this->entityManager->getConnection();
            $connection->connect(); // Attempt to establish connection if not already
            if (!$connection->isConnected()) {
                throw new \RuntimeException('Database connection not established.');
            }
            // Optional: Ping the database
            $statement = $connection->prepare('SELECT 1');
            $statement->execute();
            $statement->fetch();
            $checks['database'] = ['status' => 'ok'];
        } catch (\Throwable $e) {
            $status = 'error';
            $checks['database'] = ['status' => 'error', 'message' => $e->getMessage()];
        }

        // 2. Add other critical service checks here (e.g., Redis, external APIs)
        // Example: Redis check (assuming a service is injected or globally available)
        /*
        try {
            // $redisClient->ping(); // Replace with your actual Redis client call
            // $checks['redis'] = ['status' => 'ok'];
        } catch (\Throwable $e) {
            $status = 'error';
            $checks['redis'] = ['status' => 'error', 'message' => $e->getMessage()];
        }
        */

        // 3. Basic Application Logic Check (e.g., can we load a core configuration?)
        try {
            // Example: Check if a critical configuration parameter is accessible
            // $configValue = $this->getParameter('app.critical_setting');
            // if (empty($configValue)) {
            //     throw new \RuntimeException('Critical setting is missing.');
            // }
            $checks['app_logic'] = ['status' => 'ok'];
        } catch (\Throwable $e) {
            $status = 'error';
            $checks['app_logic'] = ['status' => 'error', 'message' => $e->getMessage()];
        }

        $response = new JsonResponse([
            'status' => $status,
            'checks' => $checks,
            'timestamp' => (new \DateTime())->format(\DateTime::ATOM),
        ]);

        if ($status === 'error') {
            $response->setStatusCode(Response::HTTP_SERVICE_UNAVAILABLE);
        } else {
            $response->setStatusCode(Response::HTTP_OK);
        }

        return $response;
    }
}

This endpoint should be monitored by your external monitoring system (e.g., Prometheus Blackbox Exporter, Datadog, New Relic). The Blackbox Exporter can be configured to probe this endpoint via HTTP and parse the JSON response. If the ‘status’ field is not ‘ok’ or the HTTP status code is not 200 (or 503 for errors), an alert should be triggered.

Integrating with Prometheus

To integrate this with Prometheus, you’d configure the Blackbox Exporter to perform an HTTP probe. The probe configuration in blackbox.yml would look something like this:

modules:
  http_app_health:
    prober: http
    timeout: 10s
    http:
      method: GET
      headers:
        Host: your-app.example.com
      # Basic authentication if your health endpoint requires it
      # basic_auth:
      #   username: "user"
      #   password: "password"
      fail_if_not_ssl: false # Set to true if using HTTPS
      fail_if_ssl: false
      valid_status_codes: [] # Default is 2xx
      body_string: "" # If you expect a specific body
      body_file: ""
      no_follow_redirects: false
      fail_if_body_contains_string: ""
      fail_if_body_not_contains_string: ""
      tls_config:
        insecure_skip_verify: false
        ca_file: /path/to/ca.crt
        cert_file: /path/to/client.crt
        key_file: /path/to/client.key
        server_name: your-app.example.com
      # This is crucial for checking the JSON response
      fail_if_json_not_has_value:
        - key: "status"
          value: "ok"

And in your Prometheus configuration (prometheus.yml), you’d scrape the Blackbox Exporter:

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_app_health]
    static_configs:
      - targets:
        - http://your-app-health-endpoint.example.com:80/health # Replace with your actual health endpoint URL
        - https://another-app-health.example.com:443/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter.service.consul:9115 # Replace with your Blackbox Exporter address

This setup ensures that not only is the server responding, but the PHP application itself is reporting a healthy state based on its internal checks. Alerts can then be configured in Alertmanager based on Prometheus metrics like probe_success.

Elasticsearch Cluster Health and Performance Monitoring

Elasticsearch clusters, especially those hosting critical data for PHP applications (e.g., for search, logging, or analytics), demand rigorous monitoring. OVH’s managed Elasticsearch services or self-hosted clusters require attention to cluster health, node status, JVM metrics, and query performance.

Key Elasticsearch Metrics to Track

Cluster Health API: The _cluster/health endpoint provides a high-level overview of the cluster’s status (green, yellow, red), number of nodes, shards, and pending tasks.
Node Stats API: The _nodes/stats endpoint offers detailed metrics per node, including JVM heap usage, CPU utilization, disk I/O, network traffic, and file system usage.
Index Stats API: Useful for understanding performance at the index level, including search and indexing latency, document counts, and segment sizes.
JVM Metrics: Heap usage, garbage collection activity, and thread counts are critical for identifying potential performance bottlenecks.
Query Performance: Slow logs are essential for identifying inefficient queries that can degrade cluster performance.

Leveraging Metricbeat for Data Collection

Metricbeat, part of the Elastic Stack, is an excellent agent for collecting system and service metrics. It can be deployed on your Elasticsearch nodes (or a dedicated monitoring node) to gather detailed performance data.

Here’s a sample configuration for modules.d/elasticsearch.yml in Metricbeat:

# Metricbeat configuration for Elasticsearch module
- module: elasticsearch
  period: 10s # How often to fetch metrics
  hosts: ["http://localhost:9200"] # Or your Elasticsearch node's IP/hostname
  # Optional: Authentication
  # username: "elastic"
  # password: "changeme"

  # Enable specific metricsets
  metricsets:
    - node
    - cluster
    - index
    - shard
    - node_stats
    - index_stats
    - cluster_stats

  # Optional: Configure index settings for Elasticsearch output
  index:
    name: "metricbeat-elasticsearch-%{+yyyy.MM.dd}"
    # Elasticsearch output configuration (assuming it's configured in metricbeat.yml)
    # output.elasticsearch:
    #   hosts: ["http://localhost:9200"]
    #   username: "elastic"
    #   password: "changeme"

  # Optional: Configure Logstash output if you use it as an intermediary
  # output.logstash:
  #   hosts: ["localhost:5044"]

  # Optional: Configure specific index patterns for different metricsets
  # node_stats.index.name: "metricbeat-elasticsearch-node-stats-%{+yyyy.MM.dd}"
  # cluster_stats.index.name: "metricbeat-elasticsearch-cluster-stats-%{+yyyy.MM.dd}"

Ensure your main metricbeat.yml is configured to send data to your Elasticsearch cluster or Logstash instance. A typical output configuration in metricbeat.yml would look like:

#============================== Elasticsearch output ==============================
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["http://your-elasticsearch-host:9200"]

  # Protocol - either http or https
  protocol: "http"

  # Authentication credentials - optional
  # username: "elastic"
  # password: "changeme"

  # Index name prefix for the generated Elasticsearch indices. If not set, the
  # default value "metricbeat" is used.
  index: "metricbeat-%{[agent.version]}-%{+yyyy.MM.dd}"

#================================ Logstash output =================================
#output.logstash:
#  # The Logstash hosts
#  hosts: ["localhost:5044"]

Setting Up Alerts with Kibana and Elasticsearch Watcher

Once metrics are flowing into Elasticsearch, Kibana’s monitoring UI provides excellent dashboards. For proactive alerting, Elasticsearch’s Watcher (or Kibana’s Alerting features) is crucial.

Here’s an example of a Watcher definition to alert on a red or yellow cluster health status. This JSON would be sent to the _watcher/watch API.

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          "metricbeat-*"
        ],
        "body": {
          "query": {
            "bool": {
              "must": [
                {
                  "term": {
                    "beat.name": "your-metricbeat-instance-name"
                  }
                },
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-5m/m",
                      "lt": "now/m"
                    }
                  }
                }
              ]
            }
          },
          "aggs": {
            "cluster_health": {
              "terms": {
                "field": "elasticsearch.cluster.status.keyword",
                "size": 1
              }
            }
          },
          "_source": false
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "ctx.payload.aggregations.cluster_health.buckets.length > 0 && (ctx.payload.aggregations.cluster_health.buckets[0].key == 'red' || ctx.payload.aggregations.cluster_health.buckets[0].key == 'yellow')",
      "lang": "painless"
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "profile": "default",
        "to": [
          "[email protected]"
        ],
        "subject": "Elasticsearch Cluster Alert: {{ctx.payload.aggregations.cluster_health.buckets.0.key | uppercase}}",
        "body": {
          "text": "Elasticsearch cluster '{{ctx.payload.aggregations.cluster_health.buckets.0.key | uppercase}}' status detected. Please investigate immediately. Timestamp: {{ctx.payload.aggregations.cluster_health.buckets.0.key}}"
        }
      }
    }
  }
}

Similarly, alerts can be set up for high JVM heap usage, low disk space on nodes, or excessive slow query logs. The key is to define thresholds that are meaningful for your specific workload and cluster size.

OVH Specific Considerations and Network Monitoring

When operating on OVH, understanding their network infrastructure and potential bottlenecks is crucial. While OVH provides robust infrastructure, network performance can be a silent killer of application responsiveness.

Network Latency and Bandwidth Monitoring

For applications with high inter-service communication or external API dependencies, monitoring network latency between your OVH instances and other services (including your Elasticsearch cluster if it’s separate) is vital. Tools like ping, traceroute, and more sophisticated network monitoring agents can be deployed.

Consider deploying a simple Python script on one of your OVH instances to periodically ping an external endpoint or another internal service and report the latency. This can be scraped by Prometheus.

# ping_monitor.py
import time
import subprocess
import json
from http.server import BaseHTTPRequestHandler, HTTPServer

class PingHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(b'# HELP ping_latency_seconds Last ping latency in seconds\n')
            self.wfile.write(b'# TYPE ping_latency_seconds gauge\n')

            try:
                # Ping a target, e.g., Google DNS, and capture output
                # Adjust count (-c) and timeout (-W) as needed
                result = subprocess.run(
                    ['ping', '-c', '1', '-W', '2', '8.8.8.8'],
                    capture_output=True,
                    text=True,
                    check=True
                )
                output = result.stdout
                
                # Parse latency from ping output (this is OS-dependent)
                # Example for Linux:
                for line in output.splitlines():
                    if 'avg/min/max/mdev' in line:
                        parts = line.split('/')
                        avg_latency = float(parts[1]) / 1000.0 # Convert ms to seconds
                        self.wfile.write(f'ping_latency_seconds{{target="8.8.8.8"}} {avg_latency}\n'.encode())
                        break
                else:
                    self.wfile.write(b'# Error parsing ping output\n')

            except subprocess.CalledProcessError as e:
                self.wfile.write(f'# Ping failed: {e}\n'.encode())
                self.wfile.write(b'ping_latency_seconds 0\n') # Indicate failure with 0 or NaN
            except Exception as e:
                self.wfile.write(f'# General error: {e}\n'.encode())
                self.wfile.write(b'ping_latency_seconds 0\n')
        else:
            self.send_response(404)
            self.end_headers()

def run(server_class=HTTPServer, handler_class=PingHandler, port=9100):
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    print(f'Starting httpd on port {port}...')
    httpd.serve_forever()

if __name__ == "__main__":
    run()

This script exposes a /metrics endpoint that Prometheus can scrape. You would then configure Prometheus to scrape this endpoint, and Alertmanager can trigger alerts if latency exceeds a defined threshold or if pings consistently fail.

OVH Firewall and Security Group Configuration

Ensure your OVH firewall rules and security groups are configured to allow necessary monitoring traffic (e.g., from your monitoring probes or agents) while blocking unauthorized access. This is a fundamental security practice that also aids in troubleshooting connectivity issues.

For instance, if your PHP app is on an OVH Public Cloud instance, you’d configure security groups via the OVHcloud Control Panel or API to allow inbound traffic on port 9115 for the Blackbox Exporter, and potentially restrict it to your monitoring infrastructure’s IP addresses.

Centralized Logging and Error Aggregation

Effective monitoring is incomplete without robust logging. Centralizing logs from your PHP applications and Elasticsearch cluster provides a single pane of glass for debugging and historical analysis.

PHP Application Logging

Leverage a logging library like Monolog in your PHP application. Configure it to log to files, but more importantly, to a log aggregation system.

// config/packages/monolog.yaml (Symfony example)
monolog:
    handlers:
        # File handler for local debugging (optional)
        main:
            type: stream
            path: '%kernel.logs_dir%/%kernel.environment%.log'
            level: debug
            channels: ['!event', '!doctrine']

        # Elasticsearch handler for centralized logging
        elasticsearch:
            type: elasticsearch
            host: 'your-elasticsearch-host' # Or your Logstash host
            port: 9200 # Or 8080 for Logstash
            index: 'php-app-logs-%{+YYYY.MM.dd}'
            level: 'info' # Adjust log level as needed
            formatter: 'monolog.formatter.json' # Use JSON formatter for structured logs
            # Optional: Authentication
            # user: 'elastic'
            # password: 'changeme'

The monolog.formatter.json formatter is critical for sending structured logs that Elasticsearch can easily index and query. Ensure you have the Monolog Elasticsearch handler installed (`composer require monolog/monolog elasticsearch/elasticsearch`).

Elasticsearch Slow Logs and Audit Logs

Configure Elasticsearch to log slow queries and, if necessary, audit logs. These are invaluable for performance tuning and security analysis.

In your Elasticsearch elasticsearch.yml configuration:

# elasticsearch.yml
# Slow logs configuration
indices.query.slowlog.threshold: 10s # Log queries taking longer than 10 seconds

# Audit logs (requires the Elasticsearch X-Pack Security features)
# xpack.security.audit.enabled: true
# xpack.security.audit.outputs: [ log ] # Or file, syslog
# xpack.security.audit.rules:
#   roles:
#     - all
#   realms:
#     - all
#   index:
#     - create
#     - delete
#     - read
#     - update
#     - mapping
#     - settings
#     - bulk
#     - template
#   search:
#     - provided_client
#     - provided_user
#   request:
#     - access
#     - match_all

These logs should also be shipped to your central logging system (e.g., via Filebeat if not using Metricbeat for logs) for aggregation and analysis alongside your application logs.

Conclusion: A Holistic Approach

Effective server monitoring for a PHP application and its Elasticsearch cluster on OVH is not a single tool or configuration. It’s a holistic strategy encompassing application-level health checks, deep infrastructure metrics, network performance, and centralized logging. By implementing these advanced practices, you move from reactive firefighting to proactive system management, ensuring the stability and performance of your critical services.