我正在尝试在C ++代码中添加多线程。目标是函数内部的for循环。目的是减少程序的执行时间。执行需要3.83秒。
我尝试在内部循环中添加命令#pragma omp parallel for reduction(+:sum)
(在 j for循环之前),但这还不够。花了1.98秒。目的是将时间减少到0.5秒。
我进行了一些研究以提高速度,有人建议使用带状采矿的矢量化方法以获得更好的结果。但是,我还不知道如何实现它。
有人知道怎么做吗?
代码是:
void filter(const long n, const long m, float *data, const float threshold, std::vector &result_row_ind) {
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold)
result_row_ind.push_back(i);
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
}
非常感谢
答案 0 :(得分:2)
如果可能,您可能希望并行化外循环。在OpenMP中执行此操作的最简单方法是执行以下操作:
#pragma omp parallel for
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold) {
#pragma omp critical
result_row_ind.push_back(i);
}
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
这有效,并且可能比并行化内部循环快得多(启动并行区域非常昂贵),但是它使用关键部分进行锁定以防止比赛。如果线程数量很大并且匹配结果的数量很小,则可以通过使用用户定义的向量对向量进行归约并在该循环上进行归约来避免争用,这可能会比较慢,但否则可能会明显更快。这不太正确,因为未列出向量类型,所以它是不完整的,但应该非常接近:
#pragma omp declare \
reduction(CatVec: std::vector<T>: \
omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end())) \
initializer(omp_priv=std::vector<T>())
#pragma omp parallel for reduction(CatVec: result_row_ind)
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
}
if (sum > threshold) {
result_row_ind.push_back(i);
}
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
答案 1 :(得分:0)
如果您具有支持执行策略的C ++编译器,则可以尝试将std::for_each
与执行策略std::execution::par
配合使用,以查看是否有帮助。示例:
#include <iostream>
#include <vector>
#include <algorithm>
#if __has_include(<execution>)
# include <execution>
#elif __has_include(<experimental/execution_policy>)
# include <experimental/execution_policy>
#endif
// iterator to use with std::for_each
class iterator {
size_t val;
public:
using iterator_category = std::forward_iterator_tag;
using value_type = size_t;
using difference_type = size_t;
using pointer = size_t*;
using reference = size_t&;
iterator(size_t value=0) : val(value) {}
inline iterator& operator++() { ++val; return *this; }
inline bool operator!=(const iterator& rhs) const { return val != rhs.val; }
inline reference operator*() { return val; }
};
std::vector<size_t> filter(const size_t rows, const size_t cols, const float* data, const float threshold) {
std::vector<size_t> result_row_ind;
std::vector<float> sums(rows);
iterator begin(0);
iterator end(rows);
std::for_each(std::execution::par, begin, end, [&](const size_t& row) {
const float* dataend = data + (row+1) * cols;
float& sum = sums[row];
for (const float* dataptr = data + row * cols; dataptr < dataend; ++dataptr) {
sum += *dataptr;
}
});
// pushing moved outside the threaded code to avoid using mutexes
for (size_t row = 0; row < rows; ++row) {
if (sums[row] > threshold)
result_row_ind.push_back(row);
}
std::sort(result_row_ind.begin(),
result_row_ind.end());
return result_row_ind;
}
int main() {
constexpr size_t rows = 1<<15, cols = 1<<18;
float* data = new float[rows*cols];
for (int i = 0; i < rows*cols; ++i) data[i] = (float)i / (float)100000000.;
std::vector<size_t> res = filter(rows, cols, data, 10.);
std::cout << res.size() << "\n";
delete[] data;
}