Server Monitoring Best Practices: Keeping Your C App and MySQL Clusters Alive on AWS

Proactive C Application Health Checks with Systemd and Prometheus

Maintaining the health of a C application, especially one serving critical traffic, requires more than just basic process monitoring. We need to implement deep health checks that go beyond mere existence and verify functional integrity. For applications running on AWS EC2 instances, leveraging systemd for service management and Prometheus for metrics collection provides a robust, scalable solution.

The core idea is to create a systemd service that periodically executes a health check script. This script will perform application-specific checks and exit with a non-zero status code if any check fails. Prometheus, configured with `node_exporter` and a custom exporter or a direct scrape target, can then query this status.

1. Systemd Service Unit for Health Checks

First, let’s define a systemd service that runs our health check script. This service will be configured to run periodically. We’ll use `Type=oneshot` and `OnCalendar` for scheduled execution. The `ExecStart` will point to our custom health check script.

Create a file named /etc/systemd/system/my_c_app_healthcheck.service:

[Unit]
Description=My C Application Health Check
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/my_c_app_healthcheck.sh
User=appuser
Group=appgroup
# Run every 5 minutes
OnCalendar=*:0/5
Persistent=true
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Next, create the health check script /usr/local/bin/my_c_app_healthcheck.sh. This script should be executable by the `appuser`.

#!/bin/bash

APP_PORT=8080
APP_HOST="127.0.0.1"
HEALTH_CHECK_PATH="/healthz"
LOG_FILE="/var/log/my_c_app_healthcheck.log"

# --- Application Specific Checks ---

# 1. Check if the application process is running
if ! pgrep -x "my_c_app" > /dev/null
then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: My C application process is not running." | tee -a $LOG_FILE
    exit 1
fi

# 2. Check network connectivity to the application port
if ! nc -z $APP_HOST $APP_PORT &> /dev/null
then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Cannot connect to My C application on $APP_HOST:$APP_PORT." | tee -a $LOG_FILE
    exit 1
fi

# 3. Perform an HTTP health check (if applicable)
#    This assumes your C app serves an HTTP endpoint like /healthz
if ! curl --fail --silent --connect-timeout 5 "http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_PATH}" &> /dev/null
then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: HTTP health check failed for http://${APP_HOST}:${APP_PORT}${HEALTH_CHECK_PATH}." | tee -a $LOG_FILE
    exit 1
fi

# 4. Add more application-specific checks here (e.g., database connection, queue depth)
# Example: Check if a critical file exists
# if [ ! -f "/path/to/critical/data.file" ]; then
#     echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Critical data file missing." | tee -a $LOG_FILE
#     exit 1
# fi

echo "$(date '+%Y-%m-%d %H:%M:%S') - SUCCESS: All health checks passed." | tee -a $LOG_FILE
exit 0

Make the script executable:

sudo chmod +x /usr/local/bin/my_c_app_healthcheck.sh
sudo chown appuser:appgroup /usr/local/bin/my_c_app_healthcheck.sh

Reload systemd, enable, and start the service:

sudo systemctl daemon-reload
sudo systemctl enable my_c_app_healthcheck.service
sudo systemctl start my_c_app_healthcheck.service

2. Exposing Health Check Status to Prometheus

We need a way for Prometheus to query the status of our health check. The simplest approach is to use `node_exporter`’s textfile collector. This collector reads `.prom` files from a specified directory and exposes their contents as metrics.

First, ensure `node_exporter` is installed and running on your EC2 instances. Configure it to use a textfile collector directory, e.g., /var/lib/node_exporter/textfile_collector.

Modify your `node_exporter` systemd service file (often found at /etc/systemd/system/node_exporter.service) to include the --collector.textfile.directory flag:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

[Install]
WantedBy=multi-user.target

Reload systemd and restart `node_exporter`:

sudo systemctl daemon-reload
sudo systemctl restart node_exporter

Now, modify our health check script to write a metric file. We’ll create a file in the textfile collector directory that indicates the health status.

Update /usr/local/bin/my_c_app_healthcheck.sh to include the following at the end:

# ... (previous script content) ...

# Write metric to node_exporter textfile collector
HEALTH_METRIC_FILE="/var/lib/node_exporter/textfile_collector/my_c_app_health.prom"

if [ $? -eq 0 ]; then
    # Health check passed
    echo "my_c_app_health_status 1" > $HEALTH_METRIC_FILE
    echo "$(date '+%Y-%m-%d %H:%M:%S') - SUCCESS: All health checks passed." | tee -a $LOG_FILE
else
    # Health check failed
    echo "my_c_app_health_status 0" > $HEALTH_METRIC_FILE
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ERROR: Health check failed." | tee -a $LOG_FILE
    exit 1 # Ensure systemd service itself reports failure
fi

exit 0

Ensure the `appuser` has write permissions to the textfile collector directory. You might need to add `appuser` to the `node_exporter` group or adjust permissions.

# Example: Add appuser to node_exporter group (if node_exporter runs as node_exporter user)
sudo usermod -aG node_exporter appuser
# Or adjust directory permissions (less ideal)
# sudo chmod 775 /var/lib/node_exporter/textfile_collector
# sudo chown appuser:node_exporter /var/lib/node_exporter/textfile_collector

After the next run of the systemd service, you should be able to query http://<your-ec2-private-ip>:9100/metrics and see a line like:

# HELP my_c_app_health_status Status of the C application health check (1 for healthy, 0 for unhealthy)
# TYPE my_c_app_health_status gauge
my_c_app_health_status 1

3. Prometheus Configuration and Alerting

In your Prometheus configuration (prometheus.yml), ensure you have a scrape job targeting your EC2 instances:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['ec2-instance-1:9100', 'ec2-instance-2:9100', ...] # Use private IPs or DNS names
    # If using service discovery (e.g., EC2 SD), this would be dynamic.

Now, define an alerting rule in Prometheus (e.g., in alerts.yml) to notify you when the application is unhealthy:

groups:
- name: my_c_app_alerts
  rules:
  - alert: MyCAppUnhealthy
    expr: my_c_app_health_status == 0
    for: 5m # Alert only if unhealthy for 5 minutes
    labels:
      severity: critical
    annotations:
      summary: "My C Application is unhealthy on {{ $labels.instance }}"
      description: "The health check for My C Application has failed for 5 minutes on instance {{ $labels.instance }}."

Reload Prometheus configuration for the new rules to take effect.

Monitoring MySQL Cluster Health with Percona Monitoring and Management (PMM)

For MySQL clusters on AWS, especially those requiring high availability and performance tuning, a dedicated monitoring solution is essential. Percona Monitoring and Management (PMM) is an excellent open-source choice that provides deep insights into MySQL performance, availability, and query analysis.

1. Deploying Percona Monitoring and Management (PMM) Server

The PMM server can be deployed on an EC2 instance. For production, it’s recommended to use the Docker deployment for easier management and upgrades.

Launch an EC2 instance (e.g., `t3.medium` or larger) with sufficient storage. Install Docker and Docker Compose.

Create a docker-compose.yml file:

version: '3.7'

services:
  pmm-server:
    image: perconalab/pmm-server:2
    container_name: pmm-server
    restart: always
    ports:
      - "80:80"
      - "443:443"
      - "3307:3307" # For MySQL client connections if needed
      - "9003:9003" # For Prometheus
      - "9004:9004" # For Grafana
      - "9009:9009" # For Alertmanager
    volumes:
      - pmm-server-data:/srv/www/html
      - pmm-server-certs:/etc/nginx/ssl/self-signed
    environment:
      - VIRTUAL_HOST=pmm.yourdomain.com # Optional: for reverse proxy/DNS
      - LETSENCRYPT_HOST=pmm.yourdomain.com # Optional: for Let's Encrypt
      - [email protected] # Optional: for Let's Encrypt
    networks:
      - pmm-network

volumes:
  pmm-server-data:
  pmm-server-certs:

networks:
  pmm-network:
    driver: bridge

Run PMM Server:

docker-compose up -d

Access the PMM UI at http://<pmm-server-ec2-public-ip>. The default credentials are usually admin/admin, which you’ll be prompted to change.

2. Adding MySQL Instances to PMM

PMM uses agents to collect data from your MySQL instances. The most common agent is the `mysqld_exporter`.

For each MySQL node in your cluster (e.g., RDS instances, EC2-hosted MySQL), you need to install the PMM client and register it with the PMM server.

On each MySQL node, install the PMM client:

# Example for Ubuntu/Debian
wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb
sudo apt-get update
sudo apt-get install pmm2-client -y

# Example for CentOS/RHEL
sudo rpm -Uvh https://repo.percona.com/yum/percona-release-latest.noarch.rpm
sudo yum update
sudo yum install pmm2-client -y

pmm-admin config set --server-url http://<pmm-server-ec2-public-ip>:80 --server-insecure-tls

Add your MySQL instance. You’ll need to provide credentials for PMM to connect to MySQL. It’s best practice to create a dedicated monitoring user with minimal privileges.

# Create a monitoring user in MySQL (run this on your MySQL server)
CREATE USER 'pmm'@'localhost' IDENTIFIED BY 'your_strong_password';
GRANT PROCESS, REPLICATION CLIENT, SELECT, SHOW DATABASES, SHOW VIEW ON *.* TO 'pmm'@'localhost';
FLUSH PRIVILEGES;

# Add the MySQL instance via pmm-admin
pmm-admin add mysql --host <mysql-instance-ip> --port 3306 --user pmm --password 'your_strong_password' --service-name "my-mysql-cluster-node-1"

Repeat this for all nodes in your MySQL cluster. PMM will automatically discover and monitor the cluster topology (e.g., primary/secondary roles).

3. Monitoring Cluster Health and Performance

Once instances are added, PMM will start collecting metrics. Navigate to the PMM UI:

Dashboards: Explore the pre-built dashboards for MySQL, including “MySQL Overview,” “MySQL InnoDB,” and “MySQL Performance Schema.” These provide deep dives into query performance, replication lag, connection usage, buffer pool efficiency, and more.
Query Analytics: The Query Analytics tab is invaluable for identifying slow or problematic queries. You can filter by host, user, query text, and performance metrics.
Alerting: PMM integrates with Alertmanager. You can configure alerts for critical conditions like:
- High replication lag
- MySQL server down (unreachable)
- High CPU/Memory usage on the MySQL instance
- Low disk space
- Excessive query times

To configure alerts, you’ll typically define rules within PMM’s Grafana instance or directly in Prometheus if you’re using PMM’s exporters but managing alerts externally.

Example Alert Rule (within PMM’s Grafana or Prometheus):

# Alert for high replication lag on a specific node
- alert: MySQLReplicationLagHigh
  expr: pmm_replication_lag > 60 # Assuming pmm_replication_lag metric exists and is in seconds
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High MySQL replication lag on {{ $labels.instance }}"
    description: "Replication lag on {{ $labels.instance }} has been above 60 seconds for 10 minutes. Current lag: {{ $value }}s."

# Alert for MySQL server being down (if PMM agent can't connect)
- alert: MySQLServerDown
  expr: up{job="pmm-mysql"} == 0 # Check if the scrape target for MySQL is up
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "MySQL server is down on {{ $labels.instance }}"
    description: "The MySQL server on {{ $labels.instance }} is unreachable for 5 minutes."

Ensure your PMM server has network access to all your MySQL instances and that the necessary ports (e.g., 3306) are open in your AWS Security Groups.

Integrating C App and MySQL Monitoring with CloudWatch

While Prometheus and PMM provide deep, granular monitoring, AWS CloudWatch offers centralized logging, metrics, and alarms across your entire AWS infrastructure. Integrating these tools provides a holistic view.

1. Sending C App Logs to CloudWatch Logs

Ensure your C application logs to standard output/error or a designated file. Use the CloudWatch Agent to stream these logs to CloudWatch Logs.

Install the CloudWatch Agent on your EC2 instances. Configure it via /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard.sh or by creating a configuration file.

Example CloudWatch Agent configuration (config.json):

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my_c_app.log",
            "log_group_name": "my-c-app-logs",
            "log_stream_name": "{instance_id}/my_c_app"
          },
          {
            "file_path": "/var/log/my_c_app_healthcheck.log",
            "log_group_name": "my-c-app-healthcheck-logs",
            "log_stream_name": "{instance_id}/healthcheck"
          }
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "cpu": {
        "resources": [
          "*"
        ]
      },
      "disk": {
        "resources": [
          "*"
        ]
      },
      "mem": {
        "resources": [
          "*"
        ]
      },
      "statsd": {
        "service_address": "udp:127.0.0.1:8125"
      }
    }
  }
}

Start the agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/path/to/your/config.json -s

2. Sending MySQL Metrics to CloudWatch

For RDS instances, CloudWatch is integrated by default. For EC2-hosted MySQL, you can use the CloudWatch Agent’s `statsd` input or custom metrics.

If using PMM, you can configure PMM’s Prometheus to expose metrics in a format that can be scraped by the CloudWatch Agent’s `statsd` input, or use a dedicated exporter that pushes to CloudWatch.

Alternatively, you can use the AWS SDK within a custom script to push specific MySQL metrics (e.g., from `SHOW GLOBAL STATUS`) to CloudWatch Custom Metrics.

import boto3
import pymysql
import time

# AWS Configuration
region_name = 'us-east-1'
namespace = 'MySQLCustomMetrics'

# MySQL Configuration
mysql_host = 'your_mysql_host'
mysql_user = 'your_monitor_user'
mysql_password = 'your_monitor_password'
mysql_db = 'your_database'

cloudwatch = boto3.client('cloudwatch', region_name=region_name)

def get_mysql_status():
    try:
        conn = pymysql.connect(host=mysql_host, user=mysql_user, password=mysql_password, db=mysql_db)
        cursor = conn.cursor(pymysql.cursors.DictCursor)
        cursor.execute("SHOW GLOBAL STATUS LIKE 'Threads_connected'")
        threads_connected = cursor.fetchone()['Threads_connected']
        cursor.execute("SHOW GLOBAL STATUS LIKE 'Slow_queries'")
        slow_queries = cursor.fetchone()['Slow_queries']
        conn.close()
        return {'Threads_connected': int(threads_connected), 'Slow_queries': int(slow_queries)}
    except Exception as e:
        print(f"Error connecting to MySQL or fetching status: {e}")
        return None

def put_metric(metric_name, value, dimensions=None):
    if dimensions is None:
        dimensions = []
    try:
        cloudwatch.put_metric_data(
            Namespace=namespace,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Count', # Adjust unit as needed (e.g., 'Seconds', 'Percent')
                    'Dimensions': dimensions
                },
            ]
        )
        print(f"Successfully put metric: {metric_name}={value}")
    except Exception as e:
        print(f"Error putting metric {metric_name}: {e}")

if __name__ == "__main__":
    while True:
        status = get_mysql_status()
        if status:
            # Add dimensions for instance ID, cluster name, etc.
            instance_dimensions = [{'Name': 'Instance', 'Value': 'your-mysql-instance-id'}]
            put_metric('ThreadsConnected', status['Threads_connected'], instance_dimensions)
            put_metric('SlowQueries', status['Slow_queries'], instance_dimensions)
        time.sleep(60) # Collect metrics every minute

Schedule this script to run periodically using cron or systemd timers.

3. CloudWatch Alarms and Dashboards

Leverage CloudWatch Alarms to trigger notifications (via SNS) based on metrics from your C app logs (e.g., error counts) or MySQL metrics (e.g., high CPU, low disk space, replication lag from RDS metrics).

Create CloudWatch Dashboards to visualize key metrics from both your C application (e.g., error rates from logs, health check status) and your MySQL cluster (e.g., RDS metrics, custom metrics). This provides a unified operational view.

By combining the deep, application-specific insights from systemd/Prometheus and PMM with the centralized logging, metrics, and alerting capabilities of CloudWatch, you establish a comprehensive and resilient monitoring strategy for your C application and MySQL clusters on AWS.