Question

我有一个多核处理器（例如，8个内核），我希望按功能int read_file(...)读取大量文件，并有效地使用所有内核。此外，执行read_file后，返回的值应放在某个位置（可能位于vector或queue）。我正考虑在所有文件的for循环中使用async（来自С++ 11）和future（用于从read_file获取结果）和启动策略launch::async 。但它在执行过程中会创建大量线程，并且读取某些文件可能会失败。也许我应该在执行期间创建的一些线程上使用一些保护？

Answer 1

读取文件不是CPU密集型的。所以你专注于错误的事情。这就像是在街对面询问如何有效利用汽车发动机的所有动力。

Answer 2

我很想争论Boost.Asio解决方案可能是理想的。

基本思想包括创建一个等待任务到达的线程池，并将所有文件读取排队到该池中。

boost::asio::io_service service;
//The work_ptr object keeps the calls to io_service.run() from returning immediately. 
//We could get rid of the object by queuing the tasks before we construct the threads.
//The method presented here is (probably) faster, however.
std::unique_ptr<boost::asio::io_service::work> work_ptr = std::make_unique<boost::asio::io_service::work>(service);

std::vector<YOUR_FILE_TYPE> files = /*...*/;

//Our Thread Pool
std::vector<std::thread> threads;
//std::thread::hardware_concurrency() gets us the number of logical CPU cores.
//May be twice the number of physical cores, due to Hyperthreading/similar tech
for(unsigned int thread = 0; thread < std::thread::hardware_concurrency(); thread++) {
    threads.emplace_back([&]{service.run();});
}

//The basic functionality: We "post" tasks to the io_service.
std::vector<int> ret_vals;
ret_vals.resize(files.size());
for(size_t index = 0; index < files.size(); index++) {
    service.post([&files, &ret_vals, index]{ret_vals[index] = read_file(files[index], /*...*/);});
}

work_ptr.reset();
for(auto & thread : threads) {
    thread.join();
}

//At this time, all ret_vals values have been filled.
/*...*/

一个重要的警告：从磁盘读取比从内存读取慢几个数量级。我提供的解决方案几乎可以扩展到任意数量的线程，但是没有理由相信多线程会提高这项任务的性能，因为你几乎肯定会遇到I / O瓶颈，特别是如果你的存储介质是传统的硬盘，而不是固态硬盘。

这并不是说这自然是一个坏主意;毕竟，如果您的read_file函数涉及大量处理数据（而不仅仅是读取它），那么性能提升可能非常真实。但我确实怀疑你的用例是一个“过早优化”的情况，这是编程生产力的死亡率。

Answer 3

异步IO通常通过基于事件的解决方案完成。您可以使用libevent，libuv，{{1}}等

Answer 4

我已经编写了代码并完成了基准研究。存储子系统配置各不相同。例如。某人可能将文件分散到多个磁盘中，或者放在由多个磁盘组成的同一RAID设备上。在我看来，最好的解决方案是强大的线程池和异步I / O的组合，它们是为系统配置量身定制的。例如，线程池中的线程数可以等于硬件线程的数量; boost :: io_service对象的数量可以等于磁盘数量。

在C ++中同时读取大量文件的最佳方法是什么？

4 个答案: