Server Monitoring Best Practices: Keeping Your PHP App and Elasticsearch Clusters Alive on OVH
Proactive PHP Application Health Checks
Maintaining the health of a PHP application, especially one running on a cloud provider like OVH, requires more than just basic uptime checks. We need to go deeper, ensuring the application is not only reachable but also responsive and free from internal errors. This involves a multi-layered approach, combining external synthetic monitoring with internal application-level diagnostics.
For external checks, tools like Prometheus with Blackbox Exporter are invaluable. Blackbox Exporter allows us to probe endpoints using various protocols (HTTP, TCP, ICMP, DNS) from different geographical locations, simulating real user experience. However, this only tells us if the *server* is responding. To understand the *application’s* internal state, we need application-specific probes.
Implementing Application-Level Health Endpoints
A common and effective practice is to expose a dedicated health check endpoint within your PHP application. This endpoint should perform critical checks: database connectivity, essential service availability, and even basic application logic validation. The response should be simple and machine-readable, typically JSON.
Consider a basic health check endpoint in a Symfony application:
// src/Controller/HealthCheckController.php
namespace App\Controller;
use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
use Symfony\Component\HttpFoundation\JsonResponse;
use Symfony\Component\HttpFoundation\Response;
use Doctrine\ORM\EntityManagerInterface; // Assuming Doctrine for DB check
class HealthCheckController extends AbstractController
{
private EntityManagerInterface $entityManager;
public function __construct(EntityManagerInterface $entityManager)
{
$this->entityManager = $entityManager;
}
/**
* @Route("/health", name="app_health_check", methods={"GET"})
*/
public function index(): JsonResponse
{
$status = 'ok';
$checks = [];
// 1. Database Connectivity Check
try {
$connection = $this->entityManager->getConnection();
$connection->connect(); // Attempt to establish connection if not already
if (!$connection->isConnected()) {
throw new \RuntimeException('Database connection not established.');
}
// Optional: Ping the database
$statement = $connection->prepare('SELECT 1');
$statement->execute();
$statement->fetch();
$checks['database'] = ['status' => 'ok'];
} catch (\Throwable $e) {
$status = 'error';
$checks['database'] = ['status' => 'error', 'message' => $e->getMessage()];
}
// 2. Add other critical service checks here (e.g., Redis, external APIs)
// Example: Redis check (assuming a service is injected or globally available)
/*
try {
// $redisClient->ping(); // Replace with your actual Redis client call
// $checks['redis'] = ['status' => 'ok'];
} catch (\Throwable $e) {
$status = 'error';
$checks['redis'] = ['status' => 'error', 'message' => $e->getMessage()];
}
*/
// 3. Basic Application Logic Check (e.g., can we load a core configuration?)
try {
// Example: Check if a critical configuration parameter is accessible
// $configValue = $this->getParameter('app.critical_setting');
// if (empty($configValue)) {
// throw new \RuntimeException('Critical setting is missing.');
// }
$checks['app_logic'] = ['status' => 'ok'];
} catch (\Throwable $e) {
$status = 'error';
$checks['app_logic'] = ['status' => 'error', 'message' => $e->getMessage()];
}
$response = new JsonResponse([
'status' => $status,
'checks' => $checks,
'timestamp' => (new \DateTime())->format(\DateTime::ATOM),
]);
if ($status === 'error') {
$response->setStatusCode(Response::HTTP_SERVICE_UNAVAILABLE);
} else {
$response->setStatusCode(Response::HTTP_OK);
}
return $response;
}
}
This endpoint should be monitored by your external monitoring system (e.g., Prometheus Blackbox Exporter, Datadog, New Relic). The Blackbox Exporter can be configured to probe this endpoint via HTTP and parse the JSON response. If the ‘status’ field is not ‘ok’ or the HTTP status code is not 200 (or 503 for errors), an alert should be triggered.
Integrating with Prometheus
To integrate this with Prometheus, you’d configure the Blackbox Exporter to perform an HTTP probe. The probe configuration in blackbox.yml would look something like this:
modules:
http_app_health:
prober: http
timeout: 10s
http:
method: GET
headers:
Host: your-app.example.com
# Basic authentication if your health endpoint requires it
# basic_auth:
# username: "user"
# password: "password"
fail_if_not_ssl: false # Set to true if using HTTPS
fail_if_ssl: false
valid_status_codes: [] # Default is 2xx
body_string: "" # If you expect a specific body
body_file: ""
no_follow_redirects: false
fail_if_body_contains_string: ""
fail_if_body_not_contains_string: ""
tls_config:
insecure_skip_verify: false
ca_file: /path/to/ca.crt
cert_file: /path/to/client.crt
key_file: /path/to/client.key
server_name: your-app.example.com
# This is crucial for checking the JSON response
fail_if_json_not_has_value:
- key: "status"
value: "ok"
And in your Prometheus configuration (prometheus.yml), you’d scrape the Blackbox Exporter:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_app_health]
static_configs:
- targets:
- http://your-app-health-endpoint.example.com:80/health # Replace with your actual health endpoint URL
- https://another-app-health.example.com:443/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter.service.consul:9115 # Replace with your Blackbox Exporter address
This setup ensures that not only is the server responding, but the PHP application itself is reporting a healthy state based on its internal checks. Alerts can then be configured in Alertmanager based on Prometheus metrics like probe_success.
Elasticsearch Cluster Health and Performance Monitoring
Elasticsearch clusters, especially those hosting critical data for PHP applications (e.g., for search, logging, or analytics), demand rigorous monitoring. OVH’s managed Elasticsearch services or self-hosted clusters require attention to cluster health, node status, JVM metrics, and query performance.
Key Elasticsearch Metrics to Track
- Cluster Health API: The
_cluster/healthendpoint provides a high-level overview of the cluster’s status (green, yellow, red), number of nodes, shards, and pending tasks. - Node Stats API: The
_nodes/statsendpoint offers detailed metrics per node, including JVM heap usage, CPU utilization, disk I/O, network traffic, and file system usage. - Index Stats API: Useful for understanding performance at the index level, including search and indexing latency, document counts, and segment sizes.
- JVM Metrics: Heap usage, garbage collection activity, and thread counts are critical for identifying potential performance bottlenecks.
- Query Performance: Slow logs are essential for identifying inefficient queries that can degrade cluster performance.
Leveraging Metricbeat for Data Collection
Metricbeat, part of the Elastic Stack, is an excellent agent for collecting system and service metrics. It can be deployed on your Elasticsearch nodes (or a dedicated monitoring node) to gather detailed performance data.
Here’s a sample configuration for modules.d/elasticsearch.yml in Metricbeat:
# Metricbeat configuration for Elasticsearch module
- module: elasticsearch
period: 10s # How often to fetch metrics
hosts: ["http://localhost:9200"] # Or your Elasticsearch node's IP/hostname
# Optional: Authentication
# username: "elastic"
# password: "changeme"
# Enable specific metricsets
metricsets:
- node
- cluster
- index
- shard
- node_stats
- index_stats
- cluster_stats
# Optional: Configure index settings for Elasticsearch output
index:
name: "metricbeat-elasticsearch-%{+yyyy.MM.dd}"
# Elasticsearch output configuration (assuming it's configured in metricbeat.yml)
# output.elasticsearch:
# hosts: ["http://localhost:9200"]
# username: "elastic"
# password: "changeme"
# Optional: Configure Logstash output if you use it as an intermediary
# output.logstash:
# hosts: ["localhost:5044"]
# Optional: Configure specific index patterns for different metricsets
# node_stats.index.name: "metricbeat-elasticsearch-node-stats-%{+yyyy.MM.dd}"
# cluster_stats.index.name: "metricbeat-elasticsearch-cluster-stats-%{+yyyy.MM.dd}"
Ensure your main metricbeat.yml is configured to send data to your Elasticsearch cluster or Logstash instance. A typical output configuration in metricbeat.yml would look like:
#============================== Elasticsearch output ==============================
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["http://your-elasticsearch-host:9200"]
# Protocol - either http or https
protocol: "http"
# Authentication credentials - optional
# username: "elastic"
# password: "changeme"
# Index name prefix for the generated Elasticsearch indices. If not set, the
# default value "metricbeat" is used.
index: "metricbeat-%{[agent.version]}-%{+yyyy.MM.dd}"
#================================ Logstash output =================================
#output.logstash:
# # The Logstash hosts
# hosts: ["localhost:5044"]
Setting Up Alerts with Kibana and Elasticsearch Watcher
Once metrics are flowing into Elasticsearch, Kibana’s monitoring UI provides excellent dashboards. For proactive alerting, Elasticsearch’s Watcher (or Kibana’s Alerting features) is crucial.
Here’s an example of a Watcher definition to alert on a red or yellow cluster health status. This JSON would be sent to the _watcher/watch API.
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"indices": [
"metricbeat-*"
],
"body": {
"query": {
"bool": {
"must": [
{
"term": {
"beat.name": "your-metricbeat-instance-name"
}
},
{
"range": {
"@timestamp": {
"gte": "now-5m/m",
"lt": "now/m"
}
}
}
]
}
},
"aggs": {
"cluster_health": {
"terms": {
"field": "elasticsearch.cluster.status.keyword",
"size": 1
}
}
},
"_source": false
}
}
}
},
"condition": {
"script": {
"source": "ctx.payload.aggregations.cluster_health.buckets.length > 0 && (ctx.payload.aggregations.cluster_health.buckets[0].key == 'red' || ctx.payload.aggregations.cluster_health.buckets[0].key == 'yellow')",
"lang": "painless"
}
},
"actions": {
"send_email": {
"email": {
"profile": "default",
"to": [
"[email protected]"
],
"subject": "Elasticsearch Cluster Alert: {{ctx.payload.aggregations.cluster_health.buckets.0.key | uppercase}}",
"body": {
"text": "Elasticsearch cluster '{{ctx.payload.aggregations.cluster_health.buckets.0.key | uppercase}}' status detected. Please investigate immediately. Timestamp: {{ctx.payload.aggregations.cluster_health.buckets.0.key}}"
}
}
}
}
}
Similarly, alerts can be set up for high JVM heap usage, low disk space on nodes, or excessive slow query logs. The key is to define thresholds that are meaningful for your specific workload and cluster size.
OVH Specific Considerations and Network Monitoring
When operating on OVH, understanding their network infrastructure and potential bottlenecks is crucial. While OVH provides robust infrastructure, network performance can be a silent killer of application responsiveness.
Network Latency and Bandwidth Monitoring
For applications with high inter-service communication or external API dependencies, monitoring network latency between your OVH instances and other services (including your Elasticsearch cluster if it’s separate) is vital. Tools like ping, traceroute, and more sophisticated network monitoring agents can be deployed.
Consider deploying a simple Python script on one of your OVH instances to periodically ping an external endpoint or another internal service and report the latency. This can be scraped by Prometheus.
# ping_monitor.py
import time
import subprocess
import json
from http.server import BaseHTTPRequestHandler, HTTPServer
class PingHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(b'# HELP ping_latency_seconds Last ping latency in seconds\n')
self.wfile.write(b'# TYPE ping_latency_seconds gauge\n')
try:
# Ping a target, e.g., Google DNS, and capture output
# Adjust count (-c) and timeout (-W) as needed
result = subprocess.run(
['ping', '-c', '1', '-W', '2', '8.8.8.8'],
capture_output=True,
text=True,
check=True
)
output = result.stdout
# Parse latency from ping output (this is OS-dependent)
# Example for Linux:
for line in output.splitlines():
if 'avg/min/max/mdev' in line:
parts = line.split('/')
avg_latency = float(parts[1]) / 1000.0 # Convert ms to seconds
self.wfile.write(f'ping_latency_seconds{{target="8.8.8.8"}} {avg_latency}\n'.encode())
break
else:
self.wfile.write(b'# Error parsing ping output\n')
except subprocess.CalledProcessError as e:
self.wfile.write(f'# Ping failed: {e}\n'.encode())
self.wfile.write(b'ping_latency_seconds 0\n') # Indicate failure with 0 or NaN
except Exception as e:
self.wfile.write(f'# General error: {e}\n'.encode())
self.wfile.write(b'ping_latency_seconds 0\n')
else:
self.send_response(404)
self.end_headers()
def run(server_class=HTTPServer, handler_class=PingHandler, port=9100):
server_address = ('', port)
httpd = server_class(server_address, handler_class)
print(f'Starting httpd on port {port}...')
httpd.serve_forever()
if __name__ == "__main__":
run()
This script exposes a /metrics endpoint that Prometheus can scrape. You would then configure Prometheus to scrape this endpoint, and Alertmanager can trigger alerts if latency exceeds a defined threshold or if pings consistently fail.
OVH Firewall and Security Group Configuration
Ensure your OVH firewall rules and security groups are configured to allow necessary monitoring traffic (e.g., from your monitoring probes or agents) while blocking unauthorized access. This is a fundamental security practice that also aids in troubleshooting connectivity issues.
For instance, if your PHP app is on an OVH Public Cloud instance, you’d configure security groups via the OVHcloud Control Panel or API to allow inbound traffic on port 9115 for the Blackbox Exporter, and potentially restrict it to your monitoring infrastructure’s IP addresses.
Centralized Logging and Error Aggregation
Effective monitoring is incomplete without robust logging. Centralizing logs from your PHP applications and Elasticsearch cluster provides a single pane of glass for debugging and historical analysis.
PHP Application Logging
Leverage a logging library like Monolog in your PHP application. Configure it to log to files, but more importantly, to a log aggregation system.
// config/packages/monolog.yaml (Symfony example)
monolog:
handlers:
# File handler for local debugging (optional)
main:
type: stream
path: '%kernel.logs_dir%/%kernel.environment%.log'
level: debug
channels: ['!event', '!doctrine']
# Elasticsearch handler for centralized logging
elasticsearch:
type: elasticsearch
host: 'your-elasticsearch-host' # Or your Logstash host
port: 9200 # Or 8080 for Logstash
index: 'php-app-logs-%{+YYYY.MM.dd}'
level: 'info' # Adjust log level as needed
formatter: 'monolog.formatter.json' # Use JSON formatter for structured logs
# Optional: Authentication
# user: 'elastic'
# password: 'changeme'
The monolog.formatter.json formatter is critical for sending structured logs that Elasticsearch can easily index and query. Ensure you have the Monolog Elasticsearch handler installed (`composer require monolog/monolog elasticsearch/elasticsearch`).
Elasticsearch Slow Logs and Audit Logs
Configure Elasticsearch to log slow queries and, if necessary, audit logs. These are invaluable for performance tuning and security analysis.
In your Elasticsearch elasticsearch.yml configuration:
# elasticsearch.yml # Slow logs configuration indices.query.slowlog.threshold: 10s # Log queries taking longer than 10 seconds # Audit logs (requires the Elasticsearch X-Pack Security features) # xpack.security.audit.enabled: true # xpack.security.audit.outputs: [ log ] # Or file, syslog # xpack.security.audit.rules: # roles: # - all # realms: # - all # index: # - create # - delete # - read # - update # - mapping # - settings # - bulk # - template # search: # - provided_client # - provided_user # request: # - access # - match_all
These logs should also be shipped to your central logging system (e.g., via Filebeat if not using Metricbeat for logs) for aggregation and analysis alongside your application logs.
Conclusion: A Holistic Approach
Effective server monitoring for a PHP application and its Elasticsearch cluster on OVH is not a single tool or configuration. It’s a holistic strategy encompassing application-level health checks, deep infrastructure metrics, network performance, and centralized logging. By implementing these advanced practices, you move from reactive firefighting to proactive system management, ensuring the stability and performance of your critical services.