Question

在学习Rust的过程中，一位朋友让我看看我可以从Rust中获得什么样的性能，以生成单线程和多线程的前100万个素数。在尝试了几个实现之后，我只是难倒了。以下是我所看到的那种表现：

fn non_concurrent(options: &Options) {
    let mut count = 0;
    let mut current = 0;

    let ts = Instant::now();
    while count < options.count {
        if is_prime(current) {
            count += 1;
        }
        current += 1;
    }
    let d = ts.elapsed();
    println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}

fn concurrent_mutex(options: &Options) {
    let count = Arc::new(Mutex::new(0));
    let highest = Arc::new(Mutex::new(0));

    let mut cc = 0;
    let mut current = 0;

    let ts = Instant::now();

    while cc < options.count {
        let mut handles = vec![];
        for x in current..(current + options.threads) {
            let count = Arc::clone(&count);
            let highest = Arc::clone(&highest);
            let handle = thread::spawn(move || {
                if is_prime(x) {
                    let mut c = count.lock().unwrap();
                    let mut h = highest.lock().unwrap();
                    *c += 1;
                    if x > *h {
                        *h = x;
                    }
                }
            });
            handles.push(handle);
        }

        for handle in handles {
            handle.join().unwrap();
        }

        cc = *count.lock().unwrap();
        current += options.threads;
    }

    let d = ts.elapsed();
    println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}

fn concurrent_channel(options: &Options) {
    let mut count = 0;
    let mut current = 0;
    let mut highest = 0;

    let ts = Instant::now();

    while count < options.count {
        let (tx, rx) = mpsc::channel();

        for x in current..(current + options.threads) {
            let txc = mpsc::Sender::clone(&tx);

            thread::spawn(move || {
                if is_prime(x) {
                    txc.send(x).unwrap();
                }
            });
        }

        drop(tx);

        for message in rx {
            count += 1;

            if message > highest && count <= options.count {
                highest = message;
            }
        }

        current += options.threads;
    }

    let d = ts.elapsed();
    println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}

如果没有用太多代码重载问题，以下是负责每个基准测试的方法：

use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;

fn main() {
    let iterations = 100_000;
    non_threaded(iterations);
    threaded(iterations);
}

fn threaded(iterations: u32) {
    let tx = Instant::now();
    let counter = Arc::new(Mutex::new(0));
    let mut handles = vec![];

    for _ in 0..iterations {
        let counter = Arc::clone(&counter);
        let handle = thread::spawn(move || {
            let mut num = counter.lock().unwrap();
            *num = test(*num);
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }
    let d = tx.elapsed();

    println!("Threaded in {}.", dur_to_string(d));
}

fn non_threaded(iterations: u32) {
    let tx = Instant::now();
    let mut _q = 0;
    for x in 0..iterations {
        _q = test(x + 1);
    }

    let d = tx.elapsed();
    println!("Non-threaded in {}.", dur_to_string(d));
}

fn dur_to_string(d: Duration) -> String {
    let mut s = d.as_secs().to_string();
    s.push_str(".");
    s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
    s
}

fn test(x: u32) -> u32 {
    x
}

我做错了什么，或者这是标准库中带有1：1线程的正常性能？

这是一个显示相同问题的MCVE。我没有限制它一次启动的线程数，就像我在上面的代码中所做的那样。关键是，线程似乎有非常大的开销，除非我做了一些可怕的错误。

Non-threaded in 0.9.
Threaded in 5.785.

以下是我机器上的结果：

{{1}}

Answer 1

线程似乎有非常显着的开销

这不是“线程”的一般概念，而是创建和销毁大量线程的概念。

默认情况下，在Rust 1.22.1中，each spawned thread allocates 2MiB内存用作堆栈空间。在最坏的情况下，你的MCVE可以分配~200GiB的RAM。实际上，这不太可能发生，因为一些线程将退出，内存将被重用等等。我只看到它使用~400MiB。

最重要的是，与线程内变量相比，线程间通信（Mutex，通道，Atomic*）涉及开销。需要执行某种锁定以确保所有线程都能看到相同的数据。 “令人尴尬的并行”算法往往不需要很多通信。不同的通信原语也需要不同的时间量。在许多情况下，原子变量往往比其他变量更快，但不是那么广泛可用。

然后有编译器优化来解释。与线程代码相比，非线程代码更容易优化。例如，在发布模式下运行代码会显示：

Non-threaded in 0.0.
Threaded in 142.775.

没错，非线程代码没时间。编译器可以查看代码并意识到实际上没有发生任何事情并将其全部删除。我不知道你如何获得5秒的线程代码，而不是我看到的2分钟。

切换到线程池将减少大量不必要的线程创建。我们还可以使用提供作用域线程的线程池，这样我们也可以避免使用Arc：

extern crate scoped_threadpool;

use scoped_threadpool::Pool;

fn threaded(iterations: u32) {
    let tx = Instant::now();
    let counter = Mutex::new(0);

    let mut pool = Pool::new(8);

    pool.scoped(|scope| {
        for _ in 0..iterations {
            scope.execute(|| {
                let mut num = counter.lock().unwrap();
                *num = test(*num);
            });
        }
    });
    let d = tx.elapsed();

    println!("Threaded in {}.", dur_to_string(d));
}

Non-threaded in 0.0.
Threaded in 0.675.

与大多数编程一样，了解您拥有的工具并正确使用它们至关重要。

使用标准库中的1：1线程经历大量开销是否正常？

1 个答案: