Server Monitoring Best Practices: Keeping Your C App and MySQL Clusters Alive on AWS
Proactive C Application Health Checks with Systemd and Prometheus
Maintaining the health of a C application, especially one serving critical traffic, requires more than just basic process monitoring. We need to implement deep health checks that go beyond mere existence and verify functional integrity. For applications running on AWS EC2 instances, leveraging systemd for service management and Prometheus for metrics collection provides a robust, scalable solution.
The core idea is to create a systemd service that periodically executes a health check script. This script will perform application-specific checks and exit with a non-zero status code if any check fails. Prometheus, configured with `node_exporter` and a custom exporter or a direct scrape target, can then query this status.
1. Systemd Service Unit for Health Checks
First, let’s define a systemd service that runs our health check script. This service will be configured to run periodically. We’ll use `Type=oneshot` and `OnCalendar` for scheduled execution. The `ExecStart` will point to our custom health check script.
Create a file named /etc/systemd/system/my_c_app_healthcheck.service:
[Unit] Description=My C Application Health Check Wants=network-online.target After=network-online.target [Service] Type=oneshot ExecStart=/usr/local/bin/my_c_app_healthcheck.sh User=appuser Group=appgroup # Run every 5 minutes OnCalendar=*:0/5 Persistent=true StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target
Next, create the health check script /usr/local/bin/my_c_app_healthcheck.sh. This script should be executable by the `appuser`.
#!/bin/bash
APP_PORT=8080
APP_HOST="127.0.0.1"
HEALTH_CHECK_PATH="/healthz"
LOG_FILE="/var/log/my_c_app_healthcheck.log"
# --- Application Specific Checks ---
# 1. Check if the application process is running
if ! pgrep -x "my_c_app" > /dev/null
then
echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: My C application process is not running." | tee -a $LOG_FILE
exit 1
fi
# 2. Check network connectivity to the application port
if ! nc -z $APP_HOST $APP_PORT &> /dev/null
then
echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Cannot connect to My C application on $APP_HOST:$APP_PORT." | tee -a $LOG_FILE
exit 1
fi
# 3. Perform an HTTP health check (if applicable)
# This assumes your C app serves an HTTP endpoint like /healthz
if ! curl --fail --silent --connect-timeout 5 "http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_PATH}" &> /dev/null
then
echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: HTTP health check failed for http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_PATH}." | tee -a $LOG_FILE
exit 1
fi
# 4. Add more application-specific checks here (e.g., database connection, queue depth)
# Example: Check if a critical file exists
# if [ ! -f "/path/to/critical/data.file" ]; then
# echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Critical data file missing." | tee -a $LOG_FILE
# exit 1
# fi
echo "$(date '+%Y-%m-%d %H:%M:%S') - SUCCESS: All health checks passed." | tee -a $LOG_FILE
exit 0
Make the script executable:
sudo chmod +x /usr/local/bin/my_c_app_healthcheck.sh sudo chown appuser:appgroup /usr/local/bin/my_c_app_healthcheck.sh
Reload systemd, enable, and start the service:
sudo systemctl daemon-reload sudo systemctl enable my_c_app_healthcheck.service sudo systemctl start my_c_app_healthcheck.service
2. Exposing Health Check Status to Prometheus
We need a way for Prometheus to query the status of our health check. The simplest approach is to use `node_exporter`’s textfile collector. This collector reads `.prom` files from a specified directory and exposes their contents as metrics.
First, ensure `node_exporter` is installed and running on your EC2 instances. Configure it to use a textfile collector directory, e.g., /var/lib/node_exporter/textfile_collector.
Modify your `node_exporter` systemd service file (often found at /etc/systemd/system/node_exporter.service) to include the --collector.textfile.directory flag:
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector [Install] WantedBy=multi-user.target
Reload systemd and restart `node_exporter`:
sudo systemctl daemon-reload sudo systemctl restart node_exporter
Now, modify our health check script to write a metric file. We’ll create a file in the textfile collector directory that indicates the health status.
Update /usr/local/bin/my_c_app_healthcheck.sh to include the following at the end:
# ... (previous script content) ...
# Write metric to node_exporter textfile collector
HEALTH_METRIC_FILE="/var/lib/node_exporter/textfile_collector/my_c_app_health.prom"
if [ $? -eq 0 ]; then
# Health check passed
echo "my_c_app_health_status 1" > $HEALTH_METRIC_FILE
echo "$(date '+%Y-%m-%d %H:%M:%S') - SUCCESS: All health checks passed." | tee -a $LOG_FILE
else
# Health check failed
echo "my_c_app_health_status 0" > $HEALTH_METRIC_FILE
echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Health check failed." | tee -a $LOG_FILE
exit 1 # Ensure systemd service itself reports failure
fi
exit 0
Ensure the `appuser` has write permissions to the textfile collector directory. You might need to add `appuser` to the `node_exporter` group or adjust permissions.
# Example: Add appuser to node_exporter group (if node_exporter runs as node_exporter user) sudo usermod -aG node_exporter appuser # Or adjust directory permissions (less ideal) # sudo chmod 775 /var/lib/node_exporter/textfile_collector # sudo chown appuser:node_exporter /var/lib/node_exporter/textfile_collector
After the next run of the systemd service, you should be able to query http://<your-ec2-private-ip>:9100/metrics and see a line like:
# HELP my_c_app_health_status Status of the C application health check (1 for healthy, 0 for unhealthy) # TYPE my_c_app_health_status gauge my_c_app_health_status 1
3. Prometheus Configuration and Alerting
In your Prometheus configuration (prometheus.yml), ensure you have a scrape job targeting your EC2 instances:
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['ec2-instance-1:9100', 'ec2-instance-2:9100', ...] # Use private IPs or DNS names
# If using service discovery (e.g., EC2 SD), this would be dynamic.
Now, define an alerting rule in Prometheus (e.g., in alerts.yml) to notify you when the application is unhealthy:
groups:
- name: my_c_app_alerts
rules:
- alert: MyCAppUnhealthy
expr: my_c_app_health_status == 0
for: 5m # Alert only if unhealthy for 5 minutes
labels:
severity: critical
annotations:
summary: "My C Application is unhealthy on {{ $labels.instance }}"
description: "The health check for My C Application has failed for 5 minutes on instance {{ $labels.instance }}."
Reload Prometheus configuration for the new rules to take effect.
Monitoring MySQL Cluster Health with Percona Monitoring and Management (PMM)
For MySQL clusters on AWS, especially those requiring high availability and performance tuning, a dedicated monitoring solution is essential. Percona Monitoring and Management (PMM) is an excellent open-source choice that provides deep insights into MySQL performance, availability, and query analysis.
1. Deploying Percona Monitoring and Management (PMM) Server
The PMM server can be deployed on an EC2 instance. For production, it’s recommended to use the Docker deployment for easier management and upgrades.
Launch an EC2 instance (e.g., `t3.medium` or larger) with sufficient storage. Install Docker and Docker Compose.
Create a docker-compose.yml file:
version: '3.7'
services:
pmm-server:
image: perconalab/pmm-server:2
container_name: pmm-server
restart: always
ports:
- "80:80"
- "443:443"
- "3307:3307" # For MySQL client connections if needed
- "9003:9003" # For Prometheus
- "9004:9004" # For Grafana
- "9009:9009" # For Alertmanager
volumes:
- pmm-server-data:/srv/www/html
- pmm-server-certs:/etc/nginx/ssl/self-signed
environment:
- VIRTUAL_HOST=pmm.yourdomain.com # Optional: for reverse proxy/DNS
- LETSENCRYPT_HOST=pmm.yourdomain.com # Optional: for Let's Encrypt
- [email protected] # Optional: for Let's Encrypt
networks:
- pmm-network
volumes:
pmm-server-data:
pmm-server-certs:
networks:
pmm-network:
driver: bridge
Run PMM Server:
docker-compose up -d
Access the PMM UI at http://<pmm-server-ec2-public-ip>. The default credentials are usually admin/admin, which you’ll be prompted to change.
2. Adding MySQL Instances to PMM
PMM uses agents to collect data from your MySQL instances. The most common agent is the `mysqld_exporter`.
For each MySQL node in your cluster (e.g., RDS instances, EC2-hosted MySQL), you need to install the PMM client and register it with the PMM server.
On each MySQL node, install the PMM client:
# Example for Ubuntu/Debian wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb sudo apt-get update sudo apt-get install pmm2-client -y # Example for CentOS/RHEL sudo rpm -Uvh https://repo.percona.com/yum/percona-release-latest.noarch.rpm sudo yum update sudo yum install pmm2-client -y
Register the client with your PMM server:
pmm-admin config set --server-url http://<pmm-server-ec2-public-ip>:80 --server-insecure-tls
Add your MySQL instance. You’ll need to provide credentials for PMM to connect to MySQL. It’s best practice to create a dedicated monitoring user with minimal privileges.
# Create a monitoring user in MySQL (run this on your MySQL server) CREATE USER 'pmm'@'localhost' IDENTIFIED BY 'your_strong_password'; GRANT PROCESS, REPLICATION CLIENT, SELECT, SHOW DATABASES, SHOW VIEW ON *.* TO 'pmm'@'localhost'; FLUSH PRIVILEGES; # Add the MySQL instance via pmm-admin pmm-admin add mysql --host <mysql-instance-ip> --port 3306 --user pmm --password 'your_strong_password' --service-name "my-mysql-cluster-node-1"
Repeat this for all nodes in your MySQL cluster. PMM will automatically discover and monitor the cluster topology (e.g., primary/secondary roles).
3. Monitoring Cluster Health and Performance
Once instances are added, PMM will start collecting metrics. Navigate to the PMM UI:
- Dashboards: Explore the pre-built dashboards for MySQL, including “MySQL Overview,” “MySQL InnoDB,” and “MySQL Performance Schema.” These provide deep dives into query performance, replication lag, connection usage, buffer pool efficiency, and more.
- Query Analytics: The Query Analytics tab is invaluable for identifying slow or problematic queries. You can filter by host, user, query text, and performance metrics.
- Alerting: PMM integrates with Alertmanager. You can configure alerts for critical conditions like:
- High replication lag
- MySQL server down (unreachable)
- High CPU/Memory usage on the MySQL instance
- Low disk space
- Excessive query times
To configure alerts, you’ll typically define rules within PMM’s Grafana instance or directly in Prometheus if you’re using PMM’s exporters but managing alerts externally.
Example Alert Rule (within PMM’s Grafana or Prometheus):
# Alert for high replication lag on a specific node
- alert: MySQLReplicationLagHigh
expr: pmm_replication_lag > 60 # Assuming pmm_replication_lag metric exists and is in seconds
for: 10m
labels:
severity: warning
annotations:
summary: "High MySQL replication lag on {{ $labels.instance }}"
description: "Replication lag on {{ $labels.instance }} has been above 60 seconds for 10 minutes. Current lag: {{ $value }}s."
# Alert for MySQL server being down (if PMM agent can't connect)
- alert: MySQLServerDown
expr: up{job="pmm-mysql"} == 0 # Check if the scrape target for MySQL is up
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL server is down on {{ $labels.instance }}"
description: "The MySQL server on {{ $labels.instance }} is unreachable for 5 minutes."
Ensure your PMM server has network access to all your MySQL instances and that the necessary ports (e.g., 3306) are open in your AWS Security Groups.
Integrating C App and MySQL Monitoring with CloudWatch
While Prometheus and PMM provide deep, granular monitoring, AWS CloudWatch offers centralized logging, metrics, and alarms across your entire AWS infrastructure. Integrating these tools provides a holistic view.
1. Sending C App Logs to CloudWatch Logs
Ensure your C application logs to standard output/error or a designated file. Use the CloudWatch Agent to stream these logs to CloudWatch Logs.
Install the CloudWatch Agent on your EC2 instances. Configure it via /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard.sh or by creating a configuration file.
Example CloudWatch Agent configuration (config.json):
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/my_c_app.log",
"log_group_name": "my-c-app-logs",
"log_stream_name": "{instance_id}/my_c_app"
},
{
"file_path": "/var/log/my_c_app_healthcheck.log",
"log_group_name": "my-c-app-healthcheck-logs",
"log_stream_name": "{instance_id}/healthcheck"
}
]
}
}
},
"metrics": {
"metrics_collected": {
"cpu": {
"resources": [
"*"
]
},
"disk": {
"resources": [
"*"
]
},
"mem": {
"resources": [
"*"
]
},
"statsd": {
"service_address": "udp:127.0.0.1:8125"
}
}
}
}
Start the agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/path/to/your/config.json -s
2. Sending MySQL Metrics to CloudWatch
For RDS instances, CloudWatch is integrated by default. For EC2-hosted MySQL, you can use the CloudWatch Agent’s `statsd` input or custom metrics.
If using PMM, you can configure PMM’s Prometheus to expose metrics in a format that can be scraped by the CloudWatch Agent’s `statsd` input, or use a dedicated exporter that pushes to CloudWatch.
Alternatively, you can use the AWS SDK within a custom script to push specific MySQL metrics (e.g., from `SHOW GLOBAL STATUS`) to CloudWatch Custom Metrics.
import boto3
import pymysql
import time
# AWS Configuration
region_name = 'us-east-1'
namespace = 'MySQLCustomMetrics'
# MySQL Configuration
mysql_host = 'your_mysql_host'
mysql_user = 'your_monitor_user'
mysql_password = 'your_monitor_password'
mysql_db = 'your_database'
cloudwatch = boto3.client('cloudwatch', region_name=region_name)
def get_mysql_status():
try:
conn = pymysql.connect(host=mysql_host, user=mysql_user, password=mysql_password, db=mysql_db)
cursor = conn.cursor(pymysql.cursors.DictCursor)
cursor.execute("SHOW GLOBAL STATUS LIKE 'Threads_connected'")
threads_connected = cursor.fetchone()['Threads_connected']
cursor.execute("SHOW GLOBAL STATUS LIKE 'Slow_queries'")
slow_queries = cursor.fetchone()['Slow_queries']
conn.close()
return {'Threads_connected': int(threads_connected), 'Slow_queries': int(slow_queries)}
except Exception as e:
print(f"Error connecting to MySQL or fetching status: {e}")
return None
def put_metric(metric_name, value, dimensions=None):
if dimensions is None:
dimensions = []
try:
cloudwatch.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Count', # Adjust unit as needed (e.g., 'Seconds', 'Percent')
'Dimensions': dimensions
},
]
)
print(f"Successfully put metric: {metric_name}={value}")
except Exception as e:
print(f"Error putting metric {metric_name}: {e}")
if __name__ == "__main__":
while True:
status = get_mysql_status()
if status:
# Add dimensions for instance ID, cluster name, etc.
instance_dimensions = [{'Name': 'Instance', 'Value': 'your-mysql-instance-id'}]
put_metric('ThreadsConnected', status['Threads_connected'], instance_dimensions)
put_metric('SlowQueries', status['Slow_queries'], instance_dimensions)
time.sleep(60) # Collect metrics every minute
Schedule this script to run periodically using cron or systemd timers.
3. CloudWatch Alarms and Dashboards
Leverage CloudWatch Alarms to trigger notifications (via SNS) based on metrics from your C app logs (e.g., error counts) or MySQL metrics (e.g., high CPU, low disk space, replication lag from RDS metrics).
Create CloudWatch Dashboards to visualize key metrics from both your C application (e.g., error rates from logs, health check status) and your MySQL cluster (e.g., RDS metrics, custom metrics). This provides a unified operational view.
By combining the deep, application-specific insights from systemd/Prometheus and PMM with the centralized logging, metrics, and alerting capabilities of CloudWatch, you establish a comprehensive and resilient monitoring strategy for your C application and MySQL clusters on AWS.