Performance Comparison: Running C++ vs Rust Under Heavy Concurrency Benchmarks

Benchmarking Setup: The Concurrency Challenge

To rigorously compare C++ and Rust under heavy concurrency, we need a controlled environment that stresses CPU-bound tasks and inter-thread communication. Our benchmark will simulate a common scenario: processing a large number of independent requests concurrently. We’ll use a simple “workload” function that performs a fixed amount of computation and then simulates a small, variable I/O wait using `std::this_thread::sleep_for` in C++ and `std::thread::sleep` in Rust. This combination tests raw CPU performance, thread creation/management overhead, and the efficiency of synchronization primitives.

The core of the benchmark involves spawning a configurable number of threads, assigning each a portion of the total workload, and measuring the total execution time. We’ll focus on two key metrics: throughput (total operations per second) and latency (average time per operation). For synchronization, we’ll use a simple mutex to protect a shared counter incremented by each worker thread upon completion, simulating a result aggregation.

C++ Implementation: Leveraging Standard Library Concurrency

Our C++ benchmark will utilize the C++11 `` and `` facilities. We’ll define a `workload` function and then spawn `N` threads, each executing this function. A global `std::atomic` will track completed tasks, and a `std::mutex` will guard a `std::vector` where we store individual task completion times for latency analysis.

The compilation will be performed with optimizations enabled. For this example, we’ll use GCC with `-O3` and `-pthread` flags. The number of threads will be dynamically set to match the number of available CPU cores, a common practice for CPU-bound tasks.

C++ Benchmark Code

Here’s the C++ source code for the benchmark:

#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>
#include <atomic>
#include <numeric>
#include <algorithm>

// Configuration
const int NUM_OPERATIONS_PER_THREAD = 100000;
const std::chrono::milliseconds SLEEP_DURATION(1); // Simulate small I/O wait

std::atomic<size_t> completed_operations(0);
std::mutex results_mutex;
std::vector<long long> latencies;

void workload() {
    auto start_time = std::chrono::high_resolution_clock::now();

    // Simulate CPU-bound work
    volatile double result = 0.0;
    for (int i = 0; i < 1000; ++i) {
        result += std::sin(static_cast<double>(i)) * std::cos(static_cast<double>(i));
    }

    // Simulate small I/O wait
    std::this_thread::sleep_for(SLEEP_DURATION);

    auto end_time = std::chrono::high_resolution_clock::now();
    long long elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

    {
        std::lock_guard<std::mutex> lock(results_mutex);
        latencies.push_back(elapsed_ms);
    }
    completed_operations.fetch_add(1);
}

int main() {
    const unsigned int num_threads = std::thread::hardware_concurrency();
    if (num_threads == 0) {
        std::cerr << "Could not detect hardware concurrency. Using 4 threads." << std::endl;
        // Fallback if hardware_concurrency returns 0
        // num_threads = 4;
    }
    std::cout << "Using " << num_threads << " threads." << std::endl;

    std::vector<std::thread> threads;
    threads.reserve(num_threads);

    auto start_benchmark = std::chrono::high_resolution_clock::now();

    for (unsigned int i = 0; i < num_threads; ++i) {
        threads.emplace_back([&]() {
            for (int j = 0; j < NUM_OPERATIONS_PER_THREAD; ++j) {
                workload();
            }
        });
    }

    // Wait for all threads to complete
    for (auto& t : threads) {
        t.join();
    }

    auto end_benchmark = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> benchmark_duration = end_benchmark - start_benchmark;

    // Calculate metrics
    size_t total_operations = num_threads * NUM_OPERATIONS_PER_THREAD;
    double throughput = static_cast<double>(total_operations) / benchmark_duration.count();

    std::sort(latencies.begin(), latencies.end());
    double avg_latency = 0;
    if (!latencies.empty()) {
        long long sum_latencies = std::accumulate(latencies.begin(), latencies.end(), 0LL);
        avg_latency = static_cast<double>(sum_latencies) / latencies.size();
    }

    std::cout << "--- C++ Benchmark Results ---" << std::endl;
    std::cout << "Total operations: " << total_operations << std::endl;
    std::cout << "Benchmark duration: " << benchmark_duration.count() << " seconds" << std::endl;
    std::cout << "Throughput: " << throughput << " operations/sec" << std::endl;
    std::cout << "Average latency: " << avg_latency << " ms" << std::endl;
    if (!latencies.empty()) {
        std::cout << "99th percentile latency: " << latencies[static_cast<size_t>(latencies.size() * 0.99)] << " ms" << std::endl;
    }

    return 0;
}

C++ Compilation and Execution

To compile this C++ code, use a modern C++ compiler like g++ or clang++:

g++ -std=c++17 -O3 -pthread -o cpp_benchmark cpp_benchmark.cpp
./cpp_benchmark

Rust Implementation: Leveraging `std::thread` and `std::sync` Equivalents

Rust’s concurrency model is built around ownership and borrowing, which can lead to safer concurrent code. For this benchmark, we’ll use `std::thread` (which maps to `std::thread` in Rust’s standard library) and `std::sync::Mutex`. The structure will mirror the C++ version closely to ensure a fair comparison.

We’ll use `std::sync::atomic::AtomicUsize` for the completed operations counter and `std::sync::Mutex` to protect a `Vec` storing latencies. The `workload` function will perform similar computations and sleep. Compilation will be done with optimizations (`–release` flag).

Rust Benchmark Code

Here’s the Rust source code:

use std::thread;
use std::time::{Instant, Duration, SystemTime, UNIX_EPOCH};
use std::sync::{Mutex, Arc};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::vec::Vec;
use std::iter::Sum;
use std::f64::consts::PI;

// Configuration
const NUM_OPERATIONS_PER_THREAD: usize = 100000;
const SLEEP_DURATION_MS: u64 = 1; // Simulate small I/O wait

// Using Arc for shared ownership across threads
lazy_static::lazy_static! {
    static ref COMPLETED_OPERATIONS: AtomicUsize = AtomicUsize::new(0);
    static ref LATENCIES: Mutex<Vec<u128>> = Mutex::new(Vec::new());
}

fn workload() {
    let start_time = Instant::now();

    // Simulate CPU-bound work
    let mut result = 0.0;
    for i in 0..1000 {
        result += (i as f64 * PI / 180.0).sin() * (i as f64 * PI / 180.0).cos();
    }

    // Simulate small I/O wait
    thread::sleep(Duration::from_millis(SLEEP_DURATION_MS));

    let elapsed_ms = start_time.elapsed().as_millis();

    {
        let mut latencies_guard = LATENCIES.lock().unwrap();
        latencies_guard.push(elapsed_ms);
    }
    COMPLETED_OPERATIONS.fetch_add(1, Ordering::SeqCst);
}

fn main() {
    let num_threads = num_cpus::get();
    println!("Using {} threads.", num_threads);

    let mut handles = vec![];

    let start_benchmark = Instant::now();

    for _ in 0..num_threads {
        let handle = thread::spawn(move || {
            for _ in 0..NUM_OPERATIONS_PER_THREAD {
                workload();
            }
        });
        handles.push(handle);
    }

    // Wait for all threads to complete
    for handle in handles {
        handle.join().unwrap();
    }

    let end_benchmark = Instant::now();
    let benchmark_duration = start_benchmark.elapsed();

    // Calculate metrics
    let total_operations = num_threads * NUM_OPERATIONS_PER_THREAD;
    let throughput = total_operations as f64 / benchmark_duration.as_secs_f64();

    let mut latencies_guard = LATENCIES.lock().unwrap();
    latencies_guard.sort();
    let avg_latency: u128 = latencies_guard.iter().sum::() / latencies_guard.len() as u128;

    println!("--- Rust Benchmark Results ---");
    println!("Total operations: {}", total_operations);
    println!("Benchmark duration: {:.2} seconds", benchmark_duration.as_secs_f64());
    println!("Throughput: {:.2} operations/sec", throughput);
    println!("Average latency: {} ms", avg_latency);
    if !latencies_guard.is_empty() {
        println!("99th percentile latency: {} ms", latencies_guard[(latencies_guard.len() as f64 * 0.99) as usize]);
    }
}

Rust Dependencies and Compilation

To compile the Rust code, you’ll need the `num_cpus` and `lazy_static` crates. Add them to your `Cargo.toml`:

[dependencies]
num_cpus = "1.13"
lazy_static = "1.4.0"

Then, compile and run the benchmark:

cargo build --release
./target/release/rust_benchmark

Performance Analysis and Observations

When running these benchmarks on identical hardware (e.g., a modern multi-core CPU), we typically observe the following trends:

Throughput: Rust often exhibits slightly higher throughput than C++. This can be attributed to several factors: Rust’s compiler optimizations, particularly around memory layout and function inlining, can be more aggressive in release builds. Additionally, Rust’s standard library threading primitives might have a marginally lower overhead in certain scenarios due to their design and implementation.
Latency: Latency figures are usually very close. The dominant factor in latency here is the simulated `sleep_for` duration and the overhead of the mutex lock. Both languages perform well in this regard. However, in scenarios with more complex synchronization or contention, Rust’s `Mutex` implementation, which uses OS primitives, might offer more predictable performance than C++’s `std::mutex` which can sometimes have implementation-defined behavior or higher overhead depending on the standard library implementation.
Memory Usage: Rust generally has more predictable and often lower memory overhead due to its compile-time memory management and lack of a large runtime. C++’s standard library implementations, especially for threading and synchronization, can sometimes introduce more dynamic memory allocations.
Developer Experience & Safety: While not directly measured in performance, Rust’s compile-time guarantees against data races and deadlocks significantly reduce debugging time and improve overall code robustness. C++ requires careful manual management of synchronization primitives, increasing the risk of subtle concurrency bugs.

It’s crucial to note that these results can vary based on the specific compiler versions, optimization flags, operating system, and the exact nature of the “workload.” For instance, if the workload involved significant dynamic memory allocation or complex object lifetimes, the differences might become more pronounced.

Advanced Considerations and Further Benchmarking

For a more comprehensive comparison, consider these advanced scenarios:

Asynchronous Programming: Compare Rust’s `tokio` or `async-std` with C++’s `libuv`, `asio`, or custom coroutine implementations. This is critical for I/O-bound workloads.
Lock-Free Data Structures: Benchmark custom or library-provided lock-free queues and other concurrent data structures in both languages. This can significantly reduce contention and improve performance in high-throughput scenarios.
NUMA Architectures: Test performance on Non-Uniform Memory Access (NUMA) systems, paying attention to thread affinity and memory allocation strategies.
Different Synchronization Primitives: Evaluate `std::shared_mutex` (C++) vs. `RwLock` (Rust) for read-heavy workloads.
Profiling: Use tools like `perf` (Linux), Instruments (macOS), or VTune (Intel) to identify bottlenecks at a granular level (CPU cache misses, branch mispredictions, lock contention).

The choice between C++ and Rust for high-concurrency applications often boils down to a trade-off between existing C++ expertise and ecosystem versus Rust’s safety guarantees and modern language features. For new projects, Rust presents a compelling case for building robust, high-performance concurrent systems with fewer runtime surprises.