Server Monitoring Best Practices: Keeping Your Python App and MySQL Clusters Alive on AWS

Proactive Health Checks for Python Applications on EC2

Maintaining the health of Python applications deployed on AWS EC2 instances requires a multi-layered approach to monitoring. Beyond basic CPU and memory utilization, we need to inspect application-specific metrics and ensure critical processes are running. This involves leveraging both AWS native tools and custom instrumentation.

A fundamental check is the status of the Python application process itself. For applications managed by systemd, we can query its status and restart it if necessary. This script can be run periodically via cron or integrated into a more sophisticated monitoring agent.

Systemd Service Monitoring Script

This Python script checks the status of a systemd service and attempts a restart if it’s found to be inactive. It also logs the outcome for later analysis.

import subprocess
import logging
import sys

# Configure logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename='/var/log/app_health_check.log')

SERVICE_NAME = "your-python-app.service" # Replace with your actual service name

def check_service_status(service_name):
    try:
        # Check service status
        status_cmd = ["systemctl", "is-active", service_name]
        result = subprocess.run(status_cmd, capture_output=True, text=True, check=True)
        status = result.stdout.strip()
        logging.info(f"Service '{service_name}' is active: {status == 'active'}")
        return status == 'active'
    except subprocess.CalledProcessError as e:
        logging.error(f"Error checking status for service '{service_name}': {e.stderr.strip()}")
        return False
    except FileNotFoundError:
        logging.error("systemctl command not found. Is this running on a systemd-based OS?")
        return False

def restart_service(service_name):
    try:
        logging.warning(f"Attempting to restart service: {service_name}")
        restart_cmd = ["systemctl", "restart", service_name]
        subprocess.run(restart_cmd, capture_output=True, text=True, check=True)
        logging.info(f"Successfully restarted service: {service_name}")
        return True
    except subprocess.CalledProcessError as e:
        logging.error(f"Failed to restart service '{service_name}': {e.stderr.strip()}")
        return False
    except FileNotFoundError:
        logging.error("systemctl command not found. Cannot restart service.")
        return False

if __name__ == "__main__":
    if not check_service_status(SERVICE_NAME):
        logging.warning(f"Service '{SERVICE_NAME}' is not active. Attempting restart.")
        if not restart_service(SERVICE_NAME):
            logging.error(f"Failed to restart service '{SERVICE_NAME}'. Manual intervention may be required.")
            sys.exit(1) # Indicate failure
    else:
        logging.info(f"Service '{SERVICE_NAME}' is running as expected.")
        sys.exit(0) # Indicate success

To automate this, add a cron job to run this script every 5 minutes:

*/5 * * * * /usr/bin/python3 /opt/scripts/app_health_check.py >> /var/log/app_health_check_cron.log 2>&1

For more advanced monitoring, consider integrating with CloudWatch. You can use the CloudWatch Agent to collect custom metrics from your application, such as request latency, error rates, or queue depths. This allows for more granular performance analysis and proactive alerting.

Custom Metrics with CloudWatch Agent

First, ensure the CloudWatch Agent is installed and configured on your EC2 instances. You’ll need a configuration file (e.g., /opt/aws/amazon-cloudwatch-agent/bin/config.json) that specifies which metrics to collect. For Python applications, this might involve parsing application logs or exposing an HTTP endpoint with metrics.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "YourApp/EC2",
    "metrics_collected": {
      "statsd": {
        "service_address": "127.0.0.1:8125",
        "metrics_collection_interval": 60
      },
      "process": {
        "process_name": "python",
        "measurement": [
          "pid",
          "cpu_utilization",
          "memory_utilization"
        ],
        "resources": [
          "*"
        ],
        "metrics_collection_interval": 60
      }
    }
  }
}

Your Python application can then send custom metrics using a StatsD client. Libraries like statsd (for Python) make this straightforward.

from statsd import StatsClient
import time
import random

statsd_client = StatsClient('127.0.0.1', 8125)

def simulate_request():
    # Simulate request processing time
    latency = random.uniform(0.05, 0.5)
    time.sleep(latency)
    return latency

if __name__ == "__main__":
    while True:
        try:
            request_latency = simulate_request()
            statsd_client.timing('request.latency', request_latency * 1000) # Send in milliseconds

            if random.random() < 0.05: # 5% chance of error
                statsd_client.incr('request.errors')
                print("Simulated error")
            else:
                statsd_client.incr('request.success')
                print("Simulated success")

            time.sleep(1) # Send metrics every second
        except Exception as e:
            logging.error(f"Error in metric collection: {e}")
            time.sleep(5) # Wait before retrying

With these custom metrics flowing into CloudWatch, you can create alarms based on thresholds (e.g., high latency, increasing error rates) to notify your team via SNS.

Monitoring MySQL Clusters on AWS RDS

For managed MySQL instances on AWS RDS, monitoring shifts from instance-level process checks to database-specific metrics and performance insights. AWS provides extensive CloudWatch metrics for RDS, but deeper inspection often requires querying the database itself.

Key RDS CloudWatch Metrics

CPUUtilization: Percentage of active CPU utilization.
DatabaseConnections: Number of active database connections.
FreeableMemory: Amount of available memory on the DB instance.
ReadIOPS and WriteIOPS: Input/Output Operations Per Second.
ReadLatency and WriteLatency: Latency of read/write operations.
DiskQueueDepth: Number of requests waiting to be processed by the instance's storage.
NetworkReceiveThroughput and NetworkTransmitThroughput: Network traffic.
MySQL-specific metrics (if enabled): Such as BinLogDiskUsage, InnodbBufferPoolWaitRatio, SwapUsage.

Setting up CloudWatch Alarms on these metrics is crucial. For instance, an alarm on DiskQueueDepth exceeding a certain threshold (e.g., 1000 for 5 minutes) can indicate I/O bottlenecks. Similarly, a sustained high CPUUtilization or a rapidly growing BinLogDiskUsage warrants investigation.

Deep Dive with Performance Insights

AWS Performance Insights offers a more granular view of database load. It helps identify wait events, SQL queries, and hosts contributing most to database load. Enabling Performance Insights on your RDS instances is highly recommended for troubleshooting performance issues.

To access Performance Insights data programmatically or for custom dashboards, you can use the AWS SDK. Here's a Python example using Boto3 to list top SQL queries by wait event for a specific time range:

import boto3
from datetime import datetime, timedelta, timezone

rds_client = boto3.client('rds')

DB_INSTANCE_IDENTIFIER = "your-rds-instance-id" # Replace with your RDS instance ID
HOURS_TO_ANALYZE = 1 # Analyze the last hour

def get_top_sql_queries(instance_id, hours):
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=hours)

    try:
        response = rds_client.describe_db_performance_insights(
            DBInstanceIdentifier=instance_id,
            StartTime=start_time,
            EndTime=end_time,
            Metric='db.load.avg', # Or 'db.wait.total' for wait events
            PeriodInSeconds=3600, # Granularity of data points (e.g., 1 hour)
            MaxRecords=10,
            OrderBy='DESC',
            )

        print(f"Top SQL queries for {instance_id} in the last {hours} hour(s):")
        for data_point in response.get('DataPoints', []):
            for metric_data in data_point.get('Metrics', []):
                print(f"  - {metric_data.get('SQLText', 'N/A')}: {metric_data.get('Value', 'N/A')}")

    except Exception as e:
        print(f"Error retrieving Performance Insights data: {e}")

if __name__ == "__main__":
    get_top_sql_queries(DB_INSTANCE_IDENTIFIER, HOURS_TO_ANALYZE)

Custom MySQL Monitoring Queries

For specific application-level database health checks, direct SQL queries are invaluable. These can be executed periodically and their results sent to CloudWatch or used for local alerting.

Consider these common checks:

Replication Lag: Crucial for read replicas.
Long-Running Queries: Identify queries that might be impacting performance.
Connection Count: Monitor active connections against limits.
Table Locks: Detect potential deadlocks or contention.

Here's a Python snippet demonstrating how to check replication status and send a custom metric to CloudWatch:

import pymysql
import boto3
import logging

# Configure logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename='/var/log/mysql_replication_check.log')

# RDS connection details (use Secrets Manager for production)
DB_HOST = "your-rds-endpoint.region.rds.amazonaws.com"
DB_USER = "monitor_user"
DB_PASSWORD = "your_monitor_password"
DB_NAME = "mysql" # Or your application database

# CloudWatch details
CLOUDWATCH_NAMESPACE = "YourApp/RDS"
CLOUDWATCH_METRIC_NAME = "ReplicationLagSeconds"

def get_replication_lag(host, user, password, db_name):
    lag_seconds = -1 # Default to -1 if unable to check
    try:
        conn = pymysql.connect(host=host, user=user, password=password, db=db_name, cursorclass=pymysql.cursors.DictCursor)
        with conn.cursor() as cursor:
            # Check if it's a replica
            cursor.execute("SHOW SLAVE STATUS")
            slave_status = cursor.fetchone()

            if slave_status and slave_status.get('Slave_IO_Running') == 'Yes' and slave_status.get('Slave_SQL_Running') == 'Yes':
                seconds_behind_master = slave_status.get('Seconds_Behind_Master')
                if seconds_behind_master is not None:
                    lag_seconds = int(seconds_behind_master)
                    logging.info(f"Replication lag is {lag_seconds} seconds.")
                else:
                    logging.warning("Seconds_Behind_Master is NULL, cannot determine lag.")
            elif slave_status is None:
                logging.info("This is not a replica instance.")
            else:
                logging.warning(f"Replication is not running. IO: {slave_status.get('Slave_IO_Running')}, SQL: {slave_status.get('Slave_SQL_Running')}")
    except pymysql.Error as e:
        logging.error(f"Database error checking replication: {e}")
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
    finally:
        if 'conn' in locals() and conn.open:
            conn.close()
    return lag_seconds

def put_cloudwatch_metric(namespace, metric_name, value, dimensions=None):
    try:
        cloudwatch = boto3.client('cloudwatch')
        if dimensions is None:
            dimensions = []

        cloudwatch.put_metric_data(
            Namespace=namespace,
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Value': value,
                    'Unit': 'Seconds',
                    'Dimensions': dimensions
                },
            ]
        )
        logging.info(f"Successfully put metric '{metric_name}' to CloudWatch: {value}")
    except Exception as e:
        logging.error(f"Failed to put metric to CloudWatch: {e}")

if __name__ == "__main__":
    replication_lag = get_replication_lag(DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)

    if replication_lag != -1: # Only send metric if we successfully got a value
        # Add dimensions for instance ID if needed
        metric_dimensions = [{'Name': 'DBInstanceIdentifier', 'Value': DB_HOST.split('.')[0]}]
        put_cloudwatch_metric(CLOUDWATCH_NAMESPACE, CLOUDWATCH_METRIC_NAME, replication_lag, metric_dimensions)
    else:
        logging.warning("Could not determine replication lag, metric not sent to CloudWatch.")

This script should be scheduled via cron or a similar mechanism. Ensure the IAM role or user executing this script has permissions for cloudwatch:PutMetricData.

Centralized Logging and Alerting Strategy

Effective server monitoring is incomplete without a robust strategy for log aggregation and alerting. Centralizing logs from your EC2 instances and RDS instances allows for easier debugging and correlation of events across your infrastructure.

Log Aggregation with CloudWatch Logs

The CloudWatch Agent can be configured to stream application logs, system logs, and custom logs to CloudWatch Logs. This provides a searchable, centralized repository for all your log data.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app_health_check.log",
            "log_group_name": "YourApp/EC2/AppHealth",
            "log_stream_name": "{instance_id}/app_health"
          },
          {
            "file_path": "/var/log/your-python-app.log",
            "log_group_name": "YourApp/EC2/ApplicationLogs",
            "log_stream_name": "{instance_id}/app_logs"
          },
          {
            "file_path": "/var/log/mysql_replication_check.log",
            "log_group_name": "YourApp/RDS/MySQLChecks",
            "log_stream_name": "{instance_id}/replication_check"
          }
        ]
      }
    }
  }
}

Once logs are in CloudWatch Logs, you can create Metric Filters to extract numerical data (e.g., error counts) and then set up CloudWatch Alarms based on these metrics. For example, an alarm can be triggered if the number of "ERROR" messages in your application logs exceeds a certain rate.

Alerting with SNS and Lambda

CloudWatch Alarms are typically configured to publish notifications to an Amazon Simple Notification Service (SNS) topic. From there, you can subscribe various endpoints, such as email addresses, SMS numbers, or even trigger AWS Lambda functions for automated remediation.

A common pattern is to have an SNS topic that triggers a Lambda function. This Lambda function can then perform actions like:

Attempting an automated restart of a service (if not already handled by a simpler script).
Creating a ticket in an incident management system (e.g., Jira, PagerDuty).
Scaling up or down specific resources (though auto-scaling groups often handle this more directly).
Sending detailed alerts to Slack or Microsoft Teams.

This layered approach—from basic process checks and custom metrics to deep database insights and centralized logging—provides a comprehensive monitoring strategy for keeping your Python applications and MySQL clusters healthy and available on AWS.