Troubleshooting high CPU load spikes caused by MariaDB index fragmentation on RHEL 9: Table rebuild runbooks

Identifying MariaDB CPU Spikes and Index Fragmentation

High CPU load spikes on RHEL 9 systems, particularly those hosting MariaDB, can often be traced to inefficient query execution exacerbated by index fragmentation. This fragmentation leads to increased I/O operations and CPU cycles spent scanning larger, less efficient index blocks. The first step is to confirm this hypothesis by correlating CPU usage with MariaDB activity and then examining index statistics.

We’ll use `top` or `htop` to identify the runaway process and `SHOW PROCESSLIST` in MariaDB to pinpoint the queries consuming resources. Subsequently, we’ll leverage `SHOW TABLE STATUS` and `SHOW INDEX FROM` to assess fragmentation levels.

Monitoring CPU and MariaDB Activity

During a spike, observe the output of `top`. Look for the `mysqld` process consuming a disproportionate amount of CPU. Note the PID for further investigation.

top -H -p $(pgrep mysqld)

Connect to MariaDB and examine the active queries. Filter for queries that have been running for an extended period or are in a ‘Sending data’ or ‘Sorting result’ state, as these are often indicative of performance bottlenecks.

SHOW FULL PROCESSLIST;

If specific queries are identified as problematic, analyze their execution plans using `EXPLAIN`. This will reveal if full table scans or inefficient index usage is occurring.

EXPLAIN SELECT ... FROM your_table WHERE ...;

Assessing Index Fragmentation

MariaDB (especially with InnoDB) doesn’t have a direct “fragmentation percentage” metric like some other database systems. However, we can infer fragmentation by observing the ratio of data pages to index pages and the average page size. A high ratio of index pages to data pages, or a significant number of pages with a large percentage of unused space, suggests fragmentation.

The `SHOW TABLE STATUS` command provides information about the number of rows, data length, and index length. While not a direct fragmentation metric, a rapidly growing index length relative to data length can be a symptom.

SHOW TABLE STATUS LIKE 'your_table_name';

A more granular approach involves examining the internal structure of indexes. For InnoDB, this often requires tools or a deeper dive into the storage engine’s behavior. However, a common indicator of potential fragmentation is when the `Avg_row_length` is significantly smaller than expected for the data types in your table, or when `Index_length` is disproportionately large.

For MyISAM tables (though less common now), `SHOW INDEX FROM your_table_name` can provide `Cardinality` which, if low, might indicate an index that isn’t effectively filtering data. However, for InnoDB, the primary concern is physical page fragmentation.

Table Rebuild Runbooks

Rebuilding a table effectively defragments its indexes and data, leading to improved query performance. This process can be resource-intensive and requires careful planning to minimize downtime.

Runbook 1: Online Rebuild (Minimal Downtime)

This method uses `pt-online-schema-change` from the Percona Toolkit. It’s the preferred method for production environments as it minimizes downtime by creating a new table, copying data, and then swapping the tables with a trigger-based synchronization.

Prerequisites:

Percona Toolkit installed on the RHEL 9 server.
Root or sufficient privileges to install packages and run `pt-online-schema-change`.
`pt-online-schema-change` requires a `pt-slave-uuid` or `server-id` to be set in your MariaDB configuration for replication awareness.

Steps:

Step 1: Install Percona Toolkit
If not already installed, use `dnf` or `yum` to install it.

sudo dnf install percona-toolkit -y

Step 2: Identify the table to rebuild
Based on previous analysis, select the table with significant fragmentation or impacting queries.

Step 3: Execute `pt-online-schema-change`
The basic command structure is as follows. Adjust connection parameters and options as needed. The `ALTER TABLE` statement specifies the desired state (e.g., `ENGINE=InnoDB` for a conversion, or just `OPTIMIZE TABLE` if you want to rebuild without changing engine). For defragmentation, a simple `ALTER TABLE … ENGINE=InnoDB` is often sufficient if the table is already InnoDB.

pt-online-schema-change \
  --alter "ENGINE=InnoDB" \
  --execute \
  D=your_database_name,t=your_table_name \
  h=localhost,u=your_user,p='your_password' \
  --recursion-method=none \
  --set-vars=innodb_flush_log_at_trx_commit=2,sync_binlog=0

Explanation of Key Options:

--alter "ENGINE=InnoDB": This is the core of the operation. It tells `pt-online-schema-change` to perform an `ALTER TABLE` operation. If your table is already InnoDB, you might use a different `ALTER` statement, or if you want to ensure it’s optimized, you could use `ALTER TABLE your_table_name ENGINE=InnoDB;` as the basis. For pure defragmentation of an InnoDB table, `pt-online-schema-change –alter “MODIFY COLUMN col1 col1 INT” –execute D=db,t=tbl` (where `col1` is an existing column) can trigger a rebuild. A more direct approach for optimization without schema change is often to use `OPTIMIZE TABLE` (see Runbook 2).
--execute: Immediately apply the changes.
D=your_database_name,t=your_table_name: Specifies the database and table to modify.
h=localhost,u=your_user,p='your_password': Connection details for MariaDB.
--recursion-method=none: Prevents the tool from attempting to manage replication cascades, which can be complex. For single-node or simple setups, `none` is appropriate.
--set-vars=innodb_flush_log_at_trx_commit=2,sync_binlog=0: Temporarily reduces the strictness of ACID compliance during the copy phase to speed up the process. Crucially, these settings should be reverted to their original values (typically `innodb_flush_log_at_trx_commit=1`, `sync_binlog=1`) after the operation or when the system is stable.

Monitoring during execution:

Watch the `pt-online-schema-change` output for progress and any errors.
Monitor `SHOW PROCESSLIST` for the `pt-online-schema-change` threads and the trigger threads.
Keep an eye on CPU and I/O usage on the server.

Runbook 2: Offline Rebuild (Downtime Required)

This method involves taking the table offline, performing an `OPTIMIZE TABLE` operation, and then bringing it back online. This is simpler but requires a maintenance window.

Steps:

Step 1: Schedule Downtime
Communicate a maintenance window to stakeholders.

Step 2: Lock the Table
Prevent writes to the table to ensure data consistency during the operation.

FLUSH TABLES WITH READ LOCK;
LOCK TABLES your_table_name WRITE;

Note: `FLUSH TABLES WITH READ LOCK` is a global lock. `LOCK TABLES … WRITE` is table-specific. For a single table rebuild, `LOCK TABLES` is usually sufficient. If you need to ensure no other operations interfere, `FLUSH TABLES WITH READ LOCK` is more aggressive.

Step 3: Perform the Rebuild
Use `OPTIMIZE TABLE` to defragment the table and its indexes.

OPTIMIZE TABLE your_table_name;

Note: For InnoDB, `OPTIMIZE TABLE` essentially performs a `ALTER TABLE … ENGINE=InnoDB`, which rebuilds the table. This can take a significant amount of time and disk space (temporarily). Monitor disk usage.

Step 4: Unlock the Table
Release the locks.

UNLOCK TABLES;

Step 5: Verify and Monitor
Check `SHOW TABLE STATUS` again to see if `Data_free` has decreased and `Index_length` is more reasonable. Monitor CPU and I/O after the table is back online.

Post-Rebuild Verification and Monitoring

After performing a table rebuild, it’s crucial to verify its effectiveness and establish ongoing monitoring to prevent recurrence.

Verifying Rebuild Success

Re-run the diagnostic steps used initially. Specifically, check:

CPU Load: Observe if the previous CPU spikes have subsided.
MariaDB Process List: Look for any long-running or resource-intensive queries that were previously identified.
`SHOW TABLE STATUS`: Compare the `Data_free` and `Index_length` values. A significant reduction in `Data_free` indicates successful defragmentation.
Query Performance: Re-test the problematic queries using `EXPLAIN` to confirm that they are now using indexes efficiently and not performing full table scans.

Establishing Proactive Monitoring

To catch index fragmentation issues before they cause critical CPU spikes, implement proactive monitoring:

Automated `SHOW TABLE STATUS` Checks: Periodically query `SHOW TABLE STATUS` for key tables and alert if `Data_free` exceeds a defined threshold (e.g., 20% of `Data_length`).
Query Performance Monitoring: Utilize tools like Percona Monitoring and Management (PMM), Prometheus with `mysqld_exporter`, or built-in MariaDB slow query logs to identify queries that are degrading in performance.
System Metrics: Continuously monitor server CPU, I/O, and memory usage. Set up alerts for sustained high CPU load or unusual I/O patterns.
Regular Maintenance: Schedule periodic table optimization (e.g., quarterly or semi-annually, depending on write load) for tables known to be prone to fragmentation. This can be done during off-peak hours using the offline method or `pt-online-schema-change` with a less aggressive schedule.

By combining these diagnostic, remediation, and monitoring strategies, you can effectively address MariaDB index fragmentation-induced CPU spikes on RHEL 9 and maintain a stable, high-performing database environment.

Troubleshooting high CPU load spikes caused by MariaDB index fragmentation on RHEL 9: Table rebuild runbooks

Identifying MariaDB CPU Spikes and Index Fragmentation

Monitoring CPU and MariaDB Activity

Assessing Index Fragmentation

Table Rebuild Runbooks

Runbook 1: Online Rebuild (Minimal Downtime)

Runbook 2: Offline Rebuild (Downtime Required)

Post-Rebuild Verification and Monitoring

Verifying Rebuild Success

Establishing Proactive Monitoring

Reader Interactions

Leave a Reply Cancel reply

Recent Posts

Top Categories

Our Products

Our Services