为什么Range-v3在这个例子中比STL慢?

时间:2017-06-07 13:02:45

标签: c++ performance range-v3

我正在使用Range-v3库来执行一个荣耀的find_if,并且好奇为什么google-benchmark的Range-v3代码比我的std::find_if方法更差。 g ++和clang都使用-O3#define NDEBUG提供相同的模式。

我想到的具体例子如下:使用STL:

std::vector<int> lengths(large_number, random_number);

auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto found_index = std::distance(lengths.begin(), found);    

出于此说明的目的,这有点做作,但to_find向量中的lengths变量和随机值通常会有一个随机生成器。

使用Range-v3库,我得到以下代码

using namespace ranges;    

std::vector<int> lengths(large_number, random_number);

auto const to_find = accumulate(lengths, 0l) / 2;

auto found_index = distance(lengths | view::partial_sum()
                                    | view::take_while([=](auto const i) {
                                         return i < to_find;
                                      }));

我的问题是为什么Range-v3比STL实现慢。我知道这仍然是一个实验性的库,但是代码示例可能有问题,或者我是否滥用了范围概念?

修改

google-bench驱动程序示例(不确定是否正确)

#define NDEBUG

#include <numeric>
#include <vector>

#include <benchmark/benchmark.h>

#include <range/v3/all.hpp>

static void stl_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;

    while (state.KeepRunning()) {

        auto accumulated_length = 0l;
        auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
                               accumulated_length += val;
                               return to_find < accumulated_length;
                           });
        volatile long val = std::distance(lengths.begin(), found);
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

static void ranges_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = accumulate(lengths, 0l) / 2;

    while (state.KeepRunning())
    {
        volatile long val = distance(lengths | view::partial_sum()
                                             | view::take_while([=](auto const& i) {
                                                   return i <= to_find;
                                               }));
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);

BENCHMARK_MAIN();

给出

ranges_search/2048          756 ns        756 ns     902091   20.1892GB/s
ranges_search/4096         1495 ns       1494 ns     466681   20.4285GB/s
ranges_search/32768       11872 ns      11863 ns      58902   20.5801GB/s
ranges_search/262144      94982 ns      94892 ns       7364   20.5825GB/s
ranges_search/524288     189870 ns     189691 ns       3688   20.5927GB/s
stl_search/2048             348 ns        348 ns    2000964   43.8336GB/s
stl_search/4096             690 ns        689 ns    1008295   44.2751GB/s
stl_search/32768           5497 ns       5492 ns     126097    44.452GB/s
stl_search/262144         44725 ns      44681 ns      15882   43.7122GB/s
stl_search/524288         91027 ns      90936 ns       7616   42.9563GB/s

使用clang 4.0.1和

ranges_search/2048         2309 ns       2307 ns     298507   6.61496GB/s
ranges_search/4096         4558 ns       4554 ns     154520   6.70161GB/s
ranges_search/32768       36482 ns      36454 ns      19191   6.69726GB/s
ranges_search/262144     287072 ns     286801 ns       2438   6.81004GB/s
ranges_search/524288     574230 ns     573665 ns       1209   6.80928GB/s
stl_search/2048             299 ns        298 ns    2340691   51.1437GB/s
stl_search/4096             592 ns        591 ns    1176783   51.6363GB/s
stl_search/32768           4692 ns       4689 ns     149460   52.0711GB/s
stl_search/262144         37718 ns      37679 ns      18611   51.8358GB/s
stl_search/524288         75247 ns      75173 ns       9244   51.9633GB/s

使用gcc 6.3.1。我的机器有一个Haswell一代处理器。两者都是用

编译和执行的
g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out

1 个答案:

答案 0 :(得分:22)

view::partial_sum范围内的{p> int会产生int的范围。如果to_find > INT_MAX,内部累加器将溢出,导致UB。在实践中,算法最有可能遍历整个输入并返回结束迭代器。

相反,非范围v3方法中的accumulated_lengthlong。它不会溢出,因此在处理整个输入之前定义了行为/返回。

如果您将transform的输入范围long传递到partial_sum范围,然后再通过auto found_index = distance(lengths | view::transform(convert_to<long>{}) | view::partial_sum() | view::take_while([=](auto const i) { return i < to_find; })); ,则范围-v3方法将具有正确的行为:

#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#ifdef USE_RV3
#include <range/v3/core.hpp>
#include <range/v3/algorithm.hpp>
#include <range/v3/numeric.hpp>
#include <range/v3/view.hpp>

#else
#include <algorithm>
#include <numeric>
#endif

int main() {
    constexpr size_t large_number = 1UL << 30;

    int random_number = 42;

    std::vector<int> const lengths(large_number, random_number);

    using clock_t = std::chrono::steady_clock;
    auto const start = clock_t::now();

#ifdef USE_RV3
    auto const to_find = ranges::accumulate(lengths, 0l) / 2;
#else
    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
#endif

    auto const elapsed1 = clock_t::now() - start;

#ifdef USE_RV3
    auto const found_index = ranges::distance(lengths
            | ranges::view::transform(ranges::convert_to<long>{})
            | ranges::view::partial_sum()
            | ranges::view::take_while([=](auto const i) { return !(to_find < i); }));
#else
    auto accumulated_length = 0l;
    auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                                accumulated_length += val;
                                return to_find < accumulated_length;
                            });
    auto const found_index = std::distance(lengths.begin(), found);
#endif

    auto const elapsed2 = clock_t::now() - start;

    std::cout << "elapsed1: "
        << std::chrono::duration<double, std::milli>(elapsed1).count()
        << " ms, to_find: " << to_find << "\n"
           "elapsed2: "
        << std::chrono::duration<double, std::milli>(elapsed2).count()
        << " ms, result: " << found_index << '\n';
}

即使使用此正确性修复,它仍然比在我的测试中使用标准算法慢得多。编译此测试程序:

-DUSE_RV3

g++-6 -std=c++14 -Ofast -march=native -DNDEBUG rv3.cpp -I ~/dev/range-v3/include -S -o -

无和elapsed1并且对程序集输出进行区分很有意思。通过初始化elapsed1生成的代码对于两种情况都是相同的。 elapsed2partial_sum的初始化之间的中间部分存在显着差异。 gcc在优化std版本方面做得更好:热循环在一个代码运行中与分支一起用于终止条件。范围-V3版本更加丑陋,并且跳了很多次;我推测我们需要摆弄table { border-collapse: separate; } 实现的细节,以使其对优化器更加透明。