Resolving thread pools deadlock during concurrent ActiveRecord transaction processing Under Peak Event Traffic on OVH

Diagnosing ActiveRecord Thread Pool Deadlocks Under High Load

When your Ruby on Rails application, particularly one leveraging ActiveRecord, experiences a deadlock during peak event traffic, especially on infrastructure like OVH, the root cause often lies in the interaction between database connection pooling, thread management, and transaction isolation levels. This isn’t a theoretical problem; it’s a production-halting scenario demanding immediate, precise diagnosis and resolution. We’ll dissect a common pattern and provide actionable steps.

The scenario typically unfolds when multiple threads attempt to acquire database connections concurrently, and these connections are held within long-running ActiveRecord transactions. If these transactions involve operations that can lead to circular dependencies on locks (e.g., updating related records in a specific order across different threads), a deadlock can occur. The database itself will eventually detect and break the deadlock, but not before your application threads are blocked, potentially leading to cascading failures and timeouts.

Identifying the Deadlock Signature

The first step is to confirm that a deadlock is indeed the culprit. Application logs are your primary source. Look for:

Database Error Messages: Specific error codes and messages from your database (e.g., PostgreSQL’s deadlock_detected, MySQL’s ER_LOCK_DEADLOCK).
Application-Level Timeouts: Threads waiting indefinitely for database operations, leading to request timeouts (e.g., Rack timeouts, Puma worker timeouts).
Thread Dumps: If your application server (like Puma) supports it, a thread dump will show threads stuck in database-related calls, often waiting on locks.

On OVH, especially with managed database services, you might also find deadlock information in the database’s own error logs. For PostgreSQL, this often includes a detailed report of the transactions and locks involved.

Analyzing Database Lock Contention

Once a deadlock is suspected, you need to examine the database’s locking behavior. This requires direct database introspection.

PostgreSQL: pg_locks and pg_stat_activity

PostgreSQL provides powerful views for this. When a deadlock is reported, the database often logs the `transactionid` and `pid` (process ID) of the involved transactions. You can then query pg_stat_activity and pg_locks.

-- Find active queries and their transaction IDs
SELECT
    pid,
    datname,
    usename,
    client_addr,
    state,
    query,
    xact_start,
    query_start,
    wait_event_type,
    wait_event
FROM
    pg_stat_activity
WHERE
    state = 'active' AND query NOT LIKE '%pg_stat_activity%' AND query NOT LIKE '%pg_locks%';

-- Identify locks held and waited upon by specific PIDs (replace PID_1, PID_2 with actual PIDs from deadlock report)
SELECT
    bl.pid AS blocking_pid,
    bl.locktype,
    bl.mode AS blocking_mode,
    bl.granted AS blocking_granted,
    kl.pid AS waiting_pid,
    kl.locktype AS waiting_locktype,
    kl.mode AS waiting_mode,
    kl.granted AS waiting_granted,
    kl.relation::regclass AS relation_name,
    kl.virtualxid,
    kl.transactionid
FROM
    pg_locks bl
JOIN
    pg_locks kl ON bl.locktype = kl.locktype AND bl.database IS NOT DISTINCT FROM kl.database AND bl.relation IS NOT DISTINCT FROM kl.relation AND bl.page IS NOT DISTINCT FROM kl.page AND bl.tuple IS NOT DISTINCT FROM kl.tuple AND bl.virtualxid IS NOT DISTINCT FROM kl.virtualxid AND bl.transactionid IS NOT DISTINCT FROM kl.transactionid AND bl.classid IS NOT DISTINCT FROM kl.classid AND bl.objid IS NOT DISTINCT FROM kl.objid AND bl.objsubid IS NOT DISTINCT FROM kl.objsubid AND bl.pid != kl.pid AND bl.granted AND NOT kl.granted
WHERE
    kl.pid IN (PID_1, PID_2); -- Replace with actual PIDs

The key is to find a situation where PID_1 is waiting for a lock held by PID_2, and simultaneously, PID_2 is waiting for a lock held by PID_1. The relation_name will tell you which tables are involved.

MySQL: SHOW ENGINE INNODB STATUS

MySQL’s InnoDB engine provides a wealth of information in its status output. After a deadlock, the SHOW ENGINE INNODB STATUS; command is invaluable.

SHOW ENGINE INNODB STATUS;

Look for the LATEST DETECTED DEADLOCK section. It will detail the transactions, the SQL statements being executed, and the locks that caused the deadlock. This output is usually quite verbose and self-explanatory in pinpointing the conflicting operations.

Application-Level Strategies for Mitigation

Database-level analysis is crucial for understanding *why* the deadlock occurred, but the fix often involves application-level changes. The primary goal is to reduce the window where transactions hold locks and to ensure consistent lock acquisition order.

1. Optimizing Transaction Scope

Long-running transactions are prime candidates for deadlocks. Minimize the work done within a transaction block. Fetching data, performing complex calculations, or making external API calls should ideally happen *before* starting the transaction or *after* it’s committed.

# Bad: Long operation inside transaction
ActiveRecord::Base.transaction do
  user = User.find(params[:id])
  # Simulate a long external API call
  response = ExternalService.fetch_data(user.email)
  user.update!(external_data: response.data)
  # ... more database operations
end

# Good: Move external call outside transaction
user = User.find(params[:id])
response = ExternalService.fetch_data(user.email) # External call first

ActiveRecord::Base.transaction do
  user.update!(external_data: response.data)
  # ... other database operations
end

2. Consistent Lock Acquisition Order

If multiple records are updated within a transaction, ensure they are always updated in the same order across all threads. This is particularly important when dealing with related entities.

Consider a scenario where Thread A updates `Order` then `LineItem`, and Thread B updates `LineItem` then `Order`. If they interleave, a deadlock is possible. A common pattern is to always acquire locks on records in a deterministic order, often by primary key or a natural ordering.

# Example: Updating multiple related records
# Assume we have Orders and LineItems, and we need to update both.
# To prevent deadlocks, always process in a consistent order.

def update_order_and_items(order_id, new_item_data)
  order = Order.find(order_id)
  line_items = order.line_items.order(:id) # Ensure consistent order

  ActiveRecord::Base.transaction do
    order.update!(status: 'processing') # Update order first

    line_items.each do |item|
      item.update!(new_item_data[item.id]) # Update line items in order
    end
  end
end

For more complex scenarios, you might need to explicitly use database-level locking mechanisms (e.g., SELECT ... FOR UPDATE) and ensure these are also acquired in a consistent order.

# Using SELECT FOR UPDATE with consistent ordering
def process_inventory_and_order(product_id, quantity_to_deduct)
  ActiveRecord::Base.transaction do
    # Acquire lock on Product in a consistent order (e.g., by product_id)
    product = Product.lock('FOR UPDATE').find(product_id)

    if product.stock_quantity >= quantity_to_deduct
      product.update!(stock_quantity: product.stock_quantity - quantity_to_deduct)

      # Now, create or update an Order record, also ensuring consistent ordering if multiple orders are created
      # For simplicity, assuming one order creation here. If multiple, order them by ID.
      Order.create!(product_id: product.id, quantity: quantity_to_deduct, status: 'pending')
    else
      raise ActiveRecord::Rollback, "Insufficient stock for product #{product_id}"
    end
  end
end

3. Connection Pool Management

The default ActiveRecord connection pool size might be insufficient or, more critically, too large for the number of threads actively performing database operations. On OVH, especially with shared database instances or limited network bandwidth, an oversized pool can exacerbate contention.

Puma’s configuration is key here. The number of worker threads per worker process, combined with the ActiveRecord pool size, determines the maximum number of concurrent database connections. A common recommendation is to set the ActiveRecord pool size to be slightly larger than the number of threads that will *actually* be hitting the database concurrently. A good starting point is often pool: ENV.fetch('RAILS_MAX_THREADS') { 5 } if your Puma threads are set to 5.

# config/database.yml
production:
  adapter: postgresql
  encoding: unicode
  database: your_db_name
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %> # Match or slightly exceed Puma threads
  username: your_db_user
  password: your_db_password
  host: your_db_host
  port: 5432

# config/puma.rb
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }.to_i
threads threads_count, threads_count
worker_timeout 60
workers ENV.fetch("WEB_CONCURRENCY") { 2 }.to_i

If you are using a connection pool that is too large, threads might be waiting for connections that are already held by other threads within long transactions, creating a deadlock-like situation even if the database itself isn’t reporting a deadlock. Conversely, a pool that’s too small will lead to threads waiting for connections, impacting performance.

4. Transaction Retry Logic

Since databases will automatically resolve deadlocks by rolling back one of the transactions, your application can be designed to gracefully handle this. Implement a retry mechanism for transactions that fail due to deadlocks.

MAX_RETRIES = 3
RETRY_DELAY = 0.5 # seconds

def with_transaction_retry(&block)
  retries = 0
  loop do
    begin
      ActiveRecord::Base.transaction(&block)
      return # Success
    rescue ActiveRecord::Deadlocked => e
      retries += 1
      if retries >= MAX_RETRIES
        Rails.logger.error "Deadlock detected and max retries exceeded: #{e.message}"
        raise e # Re-raise after max retries
      end

      Rails.logger.warn "Deadlock detected, retrying (#{retries}/#{MAX_RETRIES})... Error: #{e.message}"
      sleep(RETRY_DELAY * retries) # Exponential backoff can be useful here too
    rescue ActiveRecord::StatementInvalid => e
      # Catch other potential transaction-related errors if necessary
      if e.message.include?("deadlock") # Heuristic for other DBs that don't raise ActiveRecord::Deadlocked
        retries += 1
        if retries >= MAX_RETRIES
          Rails.logger.error "Deadlock detected (heuristic) and max retries exceeded: #{e.message}"
          raise e
        end
        Rails.logger.warn "Deadlock detected (heuristic), retrying (#{retries}/#{MAX_RETRIES})... Error: #{e.message}"
        sleep(RETRY_DELAY * retries)
      else
        raise e # Re-raise other StatementInvalid errors
      end
    end
  end
end

# Usage:
with_transaction_retry do
  # Your transactional code here
  user = User.lock('FOR UPDATE').find(1)
  user.update!(status: 'active')
  # ... more operations
end

Note that ActiveRecord::Deadlocked is specific to PostgreSQL. For other databases like MySQL, you might need to inspect the error message string for keywords like “deadlock” if the database driver doesn’t map it to a specific ActiveRecord exception.

Monitoring and Prevention

Beyond immediate fixes, robust monitoring is essential. Implement:

Database Performance Monitoring: Tools that track lock wait times, transaction durations, and query performance.
Application Performance Monitoring (APM): Tools like New Relic, Datadog, or Scout APM can highlight slow transactions and thread contention.
Log Aggregation: Centralized logging to quickly identify and correlate database errors with application behavior.

Regularly review your database query patterns and transaction logic, especially before and after significant traffic increases or feature deployments. Proactive analysis of potential lock contention points can prevent these critical production issues.