Server Monitoring Best Practices: Keeping Your Perl App and MongoDB Clusters Alive on AWS
Establishing a Robust Monitoring Baseline for Perl Applications on AWS EC2
Maintaining the health and performance of Perl applications deployed on AWS EC2 instances requires a multi-layered monitoring strategy. Beyond basic CPU and memory utilization, we need to delve into application-specific metrics and system-level diagnostics that directly impact user experience and service availability. This section outlines essential checks and configurations.
System-Level Metrics with `collectd` and CloudWatch Agent
While CloudWatch provides fundamental EC2 metrics, a more granular view is often necessary. We’ll leverage `collectd` for detailed system statistics and then push these to CloudWatch for centralized visibility and alarming. The CloudWatch Agent can also be configured to collect logs and custom metrics.
First, install `collectd` on your EC2 instances:
sudo apt-get update && sudo apt-get install collectd collectd-utils -y # Or for RHEL/CentOS: # sudo yum install epel-release -y # sudo yum install collectd collectd-plugins -y
Next, configure `collectd` to collect relevant metrics. A common configuration involves the `cpu`, `memory`, `disk`, and `interface` plugins. We’ll also set up the `write_cloudwatch` plugin to send data to AWS CloudWatch.
# /etc/collectd/collectd.conf
LoadPlugin cpu
LoadPlugin memory
LoadPlugin disk
LoadPlugin interface
LoadPlugin write_cloudwatch
<Plugin cpu>
ReportExtended = true
</Plugin>
<Plugin memory>
Granularity 1
</Plugin>
<Plugin disk>
DiskDevice "sda"
DiskDevice "xvda"
DiskDevice "nvme0n1"
IgnoreSelected "false"
</Plugin>
<Plugin interface>
Interface "eth0"
Interface "ens5"
</Plugin>
<Plugin write_cloudwatch>
Region us-east-1 # Replace with your AWS region
Namespace "EC2/PerlApp" # Custom namespace for your application
# Optional: IAM role attached to the EC2 instance should have permissions
# for 'cloudwatch:PutMetricData'
</Plugin>
Ensure your EC2 instance has an IAM role attached with permissions to write to CloudWatch. The `write_cloudwatch` plugin will automatically pick up credentials from the instance metadata or environment variables.
Restart `collectd` to apply the changes:
sudo systemctl restart collectd # Or for older systems: # sudo service collectd restart
Application-Specific Perl Metrics
For Perl applications, we need to monitor aspects like request latency, error rates, and worker process health. A common approach is to expose these metrics via an HTTP endpoint that `collectd` can scrape using the `httpcsv` plugin, or by writing directly to a time-series database like Prometheus (which can then be scraped by Prometheus itself or pushed to CloudWatch via exporters).
Here’s a simplified example of a Perl script exposing metrics:
use strict;
use warnings;
use HTTP::Server::Simple::CGI;
use CGI qw(:standard);
use Time::HiRes qw(time);
my $metrics = {
requests_total => 0,
errors_total => 0,
latency_sum_ms => 0,
latency_count => 0,
};
sub handle_request {
my $self = shift;
my $cgi = shift;
my $start_time = time;
# Simulate application logic
my $response_code = 200;
if (rand() < 0.05) { # 5% chance of error
$metrics->{errors_total}++;
$response_code = 500;
warn "Simulated error\n";
} else {
$metrics->{requests_total}++;
}
my $duration = time - $start_time;
$metrics->{latency_sum_ms} += $duration * 1000;
$metrics->{latency_count}++;
my $latency_avg_ms = $metrics->{latency_count} > 0 ? $metrics->{latency_sum_ms} / $metrics->{latency_count} : 0;
# Output metrics in a format collectd's httpcsv plugin can parse
# Format: metric_name:value
print "content-type: text/plain\n\n";
print "requests_total:", $metrics->{requests_total}, "\n";
print "errors_total:", $metrics->{errors_total}, "\n";
print "latency_avg_ms:", sprintf("%.2f", $latency_avg_ms), "\n";
# Simulate a response to the client
print header(-status => $response_code);
print start_html('Perl App');
print p("Request processed in " . sprintf("%.2f", $duration) . " seconds.");
print end_html;
}
# Configure collectd's httpcsv plugin to scrape this endpoint
# Example collectd config snippet:
# <Plugin httpcsv>
# <URL "http://localhost:8080/metrics">
# # Assuming your Perl app runs on port 8080
# # Metrics are exposed at /metrics
# # Parse metrics with a custom parser if needed, or rely on default
# # For simple key:value, default parsing is often sufficient
# # Example:
# # Host "my-perl-app"
# # Type "perl_app"
# # Instance "webserver_1"
# # Values "requests_total:requests_total,errors_total:errors_total,latency_avg_ms:latency_avg_ms"
# </URL>
# </Plugin>
# Simple HTTP server setup
my $server = HTTP::Server::Simple::CGI->new(sub {
my $self = shift;
my $cgi = shift;
handle_request($self, $cgi);
});
$server->run(8080); # Listen on port 8080
To integrate this with `collectd`, you would add the `httpcsv` plugin to your `collectd.conf` and configure it to scrape the `/metrics` endpoint of your Perl application. The output format is designed for easy parsing by `httpcsv`.
Log Aggregation and Analysis
Application logs are critical for debugging. We’ll use the CloudWatch Agent to collect Perl application logs and send them to CloudWatch Logs. This allows for centralized searching, filtering, and alarming on specific error patterns.
First, ensure the CloudWatch Agent is installed and configured. The agent’s configuration file (typically `/opt/aws/amazon-cloudwatch-agent/bin/config.json`) needs to specify log file locations.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"metrics_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/perl_app/app.log",
"log_group_name": "PerlApp/ApplicationLogs",
"log_stream_name": "{instance_id}/app",
"timezone": "UTC"
},
{
"file_path": "/var/log/perl_app/error.log",
"log_group_name": "PerlApp/ApplicationErrors",
"log_stream_name": "{instance_id}/errors",
"timezone": "UTC"
}
]
}
}
}
}
After updating the agent configuration, restart the agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Monitoring MongoDB Clusters on AWS with CloudWatch and Percona Monitoring and Management (PMM)
Managing MongoDB clusters, especially in a distributed environment like AWS, demands robust monitoring for performance, availability, and resource consumption. We’ll combine AWS native tools with specialized solutions like Percona Monitoring and Management (PMM) for deep insights.
Leveraging CloudWatch for MongoDB Instance Metrics
AWS provides basic EC2 metrics for instances running MongoDB. However, for specific MongoDB metrics, we need to push custom metrics. The CloudWatch Agent can be configured to collect these.
We can use the `mongostat` and `mongotop` utilities to gather real-time statistics. These can be polled periodically and pushed as custom metrics to CloudWatch. A Python script is well-suited for this task.
import subprocess
import json
import boto3
from datetime import datetime
# Configure your AWS region and MongoDB connection details
AWS_REGION = "us-east-1"
MONGODB_HOST = "localhost" # Or your MongoDB instance's IP/hostname
MONGODB_PORT = "27017"
NAMESPACE = "MongoDB/Cluster"
cloudwatch = boto3.client('cloudwatch', region_name=AWS_REGION)
def get_mongo_stats():
stats = {}
try:
# Get basic connection stats
result = subprocess.run(
["mongostat", "--host", MONGODB_HOST, "--port", MONGODB_PORT, "--noheaders", "--rowcount", "1", "--json"],
capture_output=True,
text=True,
check=True
)
data = json.loads(result.stdout)
if data and len(data) > 0:
stats["insert_per_sec"] = data[0].get("insert", 0)
stats["query_per_sec"] = data[0].get("query", 0)
stats["update_per_sec"] = data[0].get("update", 0)
stats["delete_per_sec"] = data[0].get("delete", 0)
stats["getmore_per_sec"] = data[0].get("getmore", 0)
stats["command_per_sec"] = data[0].get("command", 0)
stats["flushes_per_sec"] = data[0].get("flushes", 0)
stats["qr_per_sec"] = data[0].get("qr", 0)
stats["qw_per_sec"] = data[0].get("qw", 0)
stats["ar_per_sec"] = data[0].get("ar", 0)
stats["aw_per_sec"] = data[0].get("aw", 0)
stats["net_in_mb_per_sec"] = data[0].get("netIn", 0) / 1024.0 # Convert to MB
stats["net_out_mb_per_sec"] = data[0].get("netOut", 0) / 1024.0 # Convert to MB
stats["res_mb"] = data[0].get("res", 0)
stats["dirty_percent"] = data[0].get("dirty", 0)
stats["dirty_pages"] = data[0].get("dirty", 0) # Assuming 'dirty' is pages
stats["idx_miss_ratio"] = data[0].get("idx%miss", 0)
# Get top operations (e.g., slow queries)
result_top = subprocess.run(
["mongotop", "--host", MONGODB_HOST, "--port", MONGODB_PORT, "--json", "--quiet", "1"],
capture_output=True,
text=True,
check=True
)
top_data = json.loads(result_top.stdout)
if top_data and len(top_data) > 0:
# This is a simplified approach; mongotop output can be complex.
# We'll focus on total time spent in operations.
total_time_ms = 0
for op in top_data:
total_time_ms += op.get("time", 0)
stats["total_op_time_ms"] = total_time_ms
except subprocess.CalledProcessError as e:
print(f"Error running mongostat/mongotop: {e}")
return None
except json.JSONDecodeError:
print("Error decoding JSON output from mongostat/mongotop.")
return None
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
return stats
def put_metrics(stats):
if not stats:
return
metric_data = []
timestamp = datetime.utcnow()
for key, value in stats.items():
metric_data.append({
'MetricName': key,
'Value': value,
'Unit': 'Count' if 'per_sec' in key or 'total' in key or 'qr' in key or 'qw' in key or 'ar' in key or 'aw' in key or 'res' in key or 'dirty' in key else 'Percent' if '%' in key else 'Bytes' if 'mb' in key else 'Milliseconds',
'Timestamp': timestamp
})
try:
cloudwatch.put_metric_data(
Namespace=NAMESPACE,
MetricData=metric_data
)
print(f"Successfully put {len(metric_data)} metrics to CloudWatch.")
except Exception as e:
print(f"Error putting metrics to CloudWatch: {e}")
if __name__ == "__main__":
mongo_stats = get_mongo_stats()
put_metrics(mongo_stats)
This script can be scheduled to run periodically (e.g., via cron) and will push key MongoDB operational metrics to a custom CloudWatch namespace. You can then create CloudWatch Alarms based on these metrics.
Implementing Percona Monitoring and Management (PMM)
For a more comprehensive and integrated monitoring solution, Percona Monitoring and Management (PMM) is an excellent choice. It provides deep visibility into MongoDB performance, query analysis, and cluster health.
PMM consists of a server component and client agents. The server can be deployed on an EC2 instance or as a container. The client agents are installed on your MongoDB nodes.
PMM Server Deployment (Docker Example)
Deploying PMM Server using Docker on an EC2 instance is straightforward. Ensure your EC2 instance has sufficient resources (CPU, RAM, disk) and security groups configured to allow access.
# Install Docker and Docker Compose on your EC2 instance
sudo apt-get update && sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
# Or for RHEL/CentOS:
# sudo yum install -y yum-utils
# sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# sudo yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
# sudo systemctl start docker
# sudo systemctl enable docker
# Create a directory for PMM configuration and data
mkdir pmm-server
cd pmm-server
# Create a docker-compose.yml file
cat <<EOF > docker-compose.yml
version: '3'
services:
pmm-server:
image: perconalab/pmm-server:latest
container_name: pmm-server
restart: always
ports:
- "80:80"
- "443:443"
- "3306:3306" # For MySQL monitoring, if needed
- "9090:9090" # For Prometheus
- "9100:9100" # For Node Exporter
volumes:
- pmm-data:/srv/grafana
- pmm-data:/opt/prometheus/data
- pmm-data:/opt/consul-data
- pmm-data:/var/lib/mysql
- pmm-data:/var/lib/grafana
- pmm-data:/var/lib/prometheus
- pmm-data:/var/lib/clickhouse
- pmm-data:/var/lib/clickhouse-server
- pmm-data:/srv/www/grafana
- pmm-data:/srv/www/html
- pmm-data:/srv/www/api
- pmm-data:/srv/www/api-v2
- pmm-data:/srv/www/api-v1
- pmm-data:/srv/www/api-v3
- pmm-data:/srv/www/api-v4
- pmm-data:/srv/www/api-v5
- pmm-data:/srv/www/api-v6
- pmm-data:/srv/www/api-v7
- pmm-data:/srv/www/api-v8
- pmm-data:/srv/www/api-v9
- pmm-data:/srv/www/api-v10
- pmm-data:/srv/www/api-v11
- pmm-data:/srv/www/api-v12
- pmm-data:/srv/www/api-v13
- pmm-data:/srv/www/api-v14
- pmm-data:/srv/www/api-v15
- pmm-data:/srv/www/api-v16
- pmm-data:/srv/www/api-v17
- pmm-data:/srv/www/api-v18
- pmm-data:/srv/www/api-v19
- pmm-data:/srv/www/api-v20
- pmm-data:/srv/www/api-v21
- pmm-data:/srv/www/api-v22
- pmm-data:/srv/www/api-v23
- pmm-data:/srv/www/api-v24
- pmm-data:/srv/www/api-v25
- pmm-data:/srv/www/api-v26
- pmm-data:/srv/www/api-v27
- pmm-data:/srv/www/api-v28
- pmm-data:/srv/www/api-v29
- pmm-data:/srv/www/api-v30
- pmm-data:/srv/www/api-v31
- pmm-data:/srv/www/api-v32
- pmm-data:/srv/www/api-v33
- pmm-data:/srv/www/api-v34
- pmm-data:/srv/www/api-v35
- pmm-data:/srv/www/api-v36
- pmm-data:/srv/www/api-v37
- pmm-data:/srv/www/api-v38
- pmm-data:/srv/www/api-v39
- pmm-data:/srv/www/api-v40
- pmm-data:/srv/www/api-v41
- pmm-data:/srv/www/api-v42
- pmm-data:/srv/www/api-v43
- pmm-data:/srv/www/api-v44
- pmm-data:/srv/www/api-v45
- pmm-data:/srv/www/api-v46
- pmm-data:/srv/www/api-v47
- pmm-data:/srv/www/api-v48
- pmm-data:/srv/www/api-v49
- pmm-data:/srv/www/api-v50
- pmm-data:/srv/www/api-v51
- pmm-data:/srv/www/api-v52
- pmm-data:/srv/www/api-v53
- pmm-data:/srv/www/api-v54
- pmm-data:/srv/www/api-v55
- pmm-data:/srv/www/api-v56
- pmm-data:/srv/www/api-v57
- pmm-data:/srv/www/api-v58
- pmm-data:/srv/www/api-v59
- pmm-data:/srv/www/api-v60
- pmm-data:/srv/www/api-v61
- pmm-data:/srv/www/api-v62
- pmm-data:/srv/www/api-v63
- pmm-data:/srv/www/api-v64
- pmm-data:/srv/www/api-v65
- pmm-data:/srv/www/api-v66
- pmm-data:/srv/www/api-v67
- pmm-data:/srv/www/api-v68
- pmm-data:/srv/www/api-v69
- pmm-data:/srv/www/api-v70
- pmm-data:/srv/www/api-v71
- pmm-data:/srv/www/api-v72
- pmm-data:/srv/www/api-v73
- pmm-data:/srv/www/api-v74
- pmm-data:/srv/www/api-v75
- pmm-data:/srv/www/api-v76
- pmm-data:/srv/www/api-v77
- pmm-data:/srv/www/api-v78
- pmm-data:/srv/www/api-v79
- pmm-data:/srv/www/api-v80
- pmm-data:/srv/www/api-v81
- pmm-data:/srv/www/api-v82
- pmm-data:/srv/www/api-v83
- pmm-data:/srv/www/api-v84
- pmm-data:/srv/www/api-v85
- pmm-data:/srv/www/api-v86
- pmm-data:/srv/www/api-v87
- pmm-data:/srv/www/api-v88
- pmm-data:/srv/www/api-v89
- pmm-data:/srv/www/api-v90
- pmm-data:/srv/www/api-v91
- pmm-data:/srv/www/api-v92
- pmm-data:/srv/www/api-v93
- pmm-data:/srv/www/api-v94
- pmm-data:/srv/www/api-v95
- pmm-data:/srv/www/api-v96
- pmm-data:/srv/www/api-v97
- pmm-data:/srv/www/api-v98
- pmm-data:/srv/www/api-v99
- pmm-data:/srv/www/api-v100
- pmm-data:/srv/www/api-v101
- pmm-data:/srv/www/api-v102
- pmm-data:/srv/www/api-v103
- pmm-data:/srv/www/api-v104
- pmm-data:/srv/www/api-v105
- pmm-data:/srv/www/api-v106
- pmm-data:/srv/www/api-v107
- pmm-data:/srv/www/api-v108
- pmm-data:/srv/www/api-v109
- pmm-data:/srv/www/api-v110
- pmm-data:/srv/www/api-v111
- pmm-data:/srv/www/api-v112
- pmm-data:/srv/www/api-v113
- pmm-data:/srv/www/api-v114
- pmm-data:/srv/www/api-v115
- pmm-data:/srv/www/api-v116
- pmm-data:/srv/www/api-v117
- pmm-data:/srv/www/api-v118
- pmm-data:/srv/www/api-v119
- pmm-data:/srv/www/api-v120
- pmm-data:/srv/www/api-v121
- pmm-data:/srv/www/api-v122
- pmm-data:/srv/www/api-v123
- pmm-data:/srv/www/api-v124
- pmm-data:/srv/www/api-v125
- pmm-data:/srv/www/api-v126
- pmm-data:/srv/www/api-v127
- pmm-data:/srv/www/api-v128
- pmm-data:/srv/www/api-v129
- pmm-data:/srv/www/api-v130
- pmm-data:/srv/www/api-v131
- pmm-data:/srv/www/api-v132
- pmm-data:/srv/www/api-v133
- pmm-data:/srv/www/api-v134
- pmm-data:/srv/www/api-v135
- pmm-data:/srv/www/api-v136
- pmm-data:/srv/www/api-v137
- pmm-data:/srv/www/api-v138
- pmm-data:/srv/www/api-v139
- pmm-data:/srv/www/api-v140
- pmm-data:/srv/www/api-v141
- pmm-data:/srv/www/api-v142
- pmm-data:/srv/www/api-v143
- pmm-data:/srv/www/api-v144
- pmm-data:/srv/www/api-v145
- pmm-data:/srv/www/api-v146
- pmm-data:/srv/www/api-v147
- pmm-data:/srv/www/api-v148
- pmm-data:/srv/www/api-v149
- pmm-data:/srv/www/api-v150
- pmm-data:/srv/www/api-v151
- pmm-data:/srv/www/api-v152
- pmm-data:/srv/www/api-v153
- pmm-data:/srv/www/api-v154
- pmm-data:/srv/www/api-v155
- pmm-data:/srv/www/api-v156
- pmm-data:/srv/www/api-v157
- pmm-data:/srv/www/api-v158
- pmm-data:/srv/www/api-v159
- pmm-data:/srv/www/api-v160
- pmm-data:/srv/www/api-v161
- pmm-data:/srv/www/api-v162
- pmm-data:/srv/www/api-v163
- pmm-data:/srv/www/api-v164
- pmm-data:/srv/www/api-v165
- pmm-data:/srv/www/api-v166
- pmm-data:/srv/www/api-v167
- pmm-data:/srv/www/api-v168
- pmm-data:/srv/www/api-v169
- pmm-data:/srv/www/api-v170
- pmm-data:/srv/www/api-v171
- pmm-data:/srv/www/api-v172
- pmm-data:/srv/www/api-v173
- pmm-data:/srv/www/api-v174
- pmm-data:/srv/www/api-v175
- pmm-data:/srv/www/api-v176
- pmm-data:/srv/www/api-v177
- pmm-data:/srv/www/api-v178
- pmm-data:/srv/www/api-v179
- pmm-data:/srv/www/api-v180
- pmm-data:/srv/www/api-v181
- pmm-data:/srv/www/api-v182
- pmm-data:/srv/www/api-v183
- pmm-data:/srv/www/api-v184
- pmm-data:/srv/www/api-v185
- pmm-data:/srv/www/api-v186
- pmm-data:/srv/www/api-v187
- pmm-data:/srv/www/api-v188
- pmm-data:/srv/www/api-v189
- pmm-data:/srv/www/api-v190
- pmm-data:/srv/www/api-v191
- pmm-data:/srv/www/api-v192
- pmm-data:/srv/www/api-v193
- pmm-data:/srv/www/api-v194
- pmm-data:/srv/www/api-v195
- pmm-data:/srv/www/api-v196
- pmm-data:/srv/www/api-v197
- pmm-data:/srv/www/api-v198
- pmm-data:/srv/www/api-v199
- pmm-data:/srv/www/api-v200
- pmm-data:/srv/www/api-v201
- pmm-data:/srv/www/api-v202
- pmm-data:/srv/www/api-v203
- pmm-data:/srv/www/api-v204
- pmm-