关注my original question并考虑了一些建议的解决方案,我为C ++ 14提出了这个:
#include <algorithm>
#include <exception>
#include <iterator>
#include <cstddef>
template<class It, class Func>
auto binary_fold(It begin, It end, Func op) -> decltype(op(*begin, *end)) {
std::ptrdiff_t diff = end - begin;
switch (diff) {
case 0: throw std::out_of_range("binary fold on empty container");
case 1: return *begin;
case 2: return op(*begin, *(begin + 1));
default: { // first round to the nearest multiple of 2 and then advance
It mid{begin};
int div = diff/2;
int offset = (div%2 == 1) ? (div+1) : div; // round to the closest multiple of two (upwards)
std::advance(mid, offset);
return op( binary_fold(begin,mid,op), binary_fold(mid,end,op) );
}
}
}
该算法将递归地成对执行二元运算,直到获得结果。 E.g。
std::vector<int> v = {1,3,5,6,1};
auto result = mar::binary_fold(v.cbegin(), v.cend(), std::minus<int>());
将解决:
1 - (5-6) - (1-3) = 0
在某些情况下(如上所述)算法将保持关联,但在其他情况下(如下所示),它将是正确关联的:
std::vector<int> v = {7,4,9,2,6,8};
auto result = mar::binary_fold(v.cbegin(), v.cend(), std::minus<int>());
结果:
(7-4) - (9-2) - (6-8) = -2
我想知道如何进一步优化此算法,以便:
一个。它绝对是左或右联想
湾它尽可能快(这将放在一个openGL绘图循环中,所以它必须非常快)。
℃。制作一个TMP版本,当容器的大小已知时,它将在编译时计算偏移量(这对我的应用程序来说不是必需的,但我只是好奇它是如何完成的。)
我对b的初步想法。是一个迭代版本可能会更快,并且偏移计算可以进一步优化(可能有一些按位魔术?)。不过我还是被困住了。
答案 0 :(得分:1)
我写了一个“永远左联想”的迭代版本,你可以使用一些时间运行。在您打开编译器优化之前,它的性能会稍差。
10000次迭代的总运行时间,包含5000个值。
g++ --std=c++11 main.cpp && ./a.out
Recursive elapsed:9642msec
Iterative elapsed:10189msec
$ g++ --std=c++11 -O1 main.cpp && ./a.out
Recursive elapsed:3468msec
Iterative elapsed:3098msec
Iterative elapsed:3359msec # another run
Recursive elapsed:3668msec
$ g++ --std=c++11 -O2 main.cpp && ./a.out
Recursive elapsed:3193msec
Iterative elapsed:2763msec
Recursive elapsed:3184msec # another run
Iterative elapsed:2696msec
$ g++ --std=c++11 -O3 main.cpp && ./a.out
Recursive elapsed:3183msec
Iterative elapsed:2685msec
Recursive elapsed:3106msec # another run
Iterative elapsed:2681msec
Recursive elapsed:3054msec # another run
Iterative elapsed:2653msec
编译器可以比递归更容易优化循环。
#include <algorithm>
#include <functional>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>
template<class It, class Func>
auto binary_fold_rec(It begin, It end, Func op) -> decltype(op(*begin, *end)) {
std::ptrdiff_t diff = end - begin;
switch (diff) {
case 0: throw std::out_of_range("binary fold on empty container");
case 1: return *begin;
case 2: return op(*begin, *(begin + 1));
default: { // first round to the nearest multiple of 2 and then advance
It mid{begin};
int div = diff/2;
int offset = (div%2 == 1) ? (div+1) : div; // round to the closest multiple of two (upwards)
std::advance(mid, offset);
return op( binary_fold_rec(begin,mid,op), binary_fold_rec(mid,end,op) );
}
}
}
// left-associative
template<class It, class Func>
auto binary_fold_it(It begin, It end, Func op) -> decltype(op(*begin, *end)) {
// Allocates enough scratch to begin with that we don't need to mess with again.
std::ptrdiff_t diff = end - begin;
std::vector<decltype(op(*begin, *end))> scratch (static_cast<int>(diff));
auto scratch_current = scratch.begin();
if(diff == 0) {
throw std::out_of_range("binary fold on empty container.");
}
while(diff > 1) {
auto fake_end = (diff & 1) ? end - 1 : end;
while(begin != fake_end) {
(*scratch_current++) = op(*begin, *(begin+1));
begin += 2; // silly c++ can't guarantee ++ order, so increment here.
}
if(fake_end != end) {
*scratch_current++ = *begin;
}
end = scratch_current;
begin = scratch_current = scratch.begin();
diff = end - begin;
};
return scratch[0];
}
void run(std::initializer_list<int> elems, int expected) {
std::vector<int> v(elems);
auto result = binary_fold_it(v.begin(), v.end(), std::minus<int>());
std::cout << result << std::endl;
assert(binary_fold_it(v.begin(), v.end(), std::minus<int>()) == expected);
}
constexpr int rolls = 10000;
constexpr int min_val = -1000;
constexpr int max_val = 1000;
constexpr int num_vals = 5000;
std::vector<int> random_vector() {
// Thanks http://stackoverflow.com/questions/21516575/fill-a-vector-with-random-numbers-c
// for saving me time.
std::uniform_int_distribution<int> distribution(min_val, max_val);
std::default_random_engine generator;
std::vector<int> data(num_vals);
std::generate(data.begin(), data.end(), [&]() { return distribution(generator); });
return data;
}
template<typename It, typename Func>
void evaluate(void(*func)(It, It, Func), const char* message) {
auto start = std::chrono::high_resolution_clock::now();
for(int i=0; i<rolls; i++) {
auto data = random_vector();
func(data.begin(), data.end(), std::minus<int>());
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << message << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << "msec\n";
}
void evaluate(void(*func)(), const char* message) {
auto start = std::chrono::high_resolution_clock::now();
for(int i=0; i<rolls; i++) {
func();
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << message << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << "msec\n";
}
void time_it() {
auto data = random_vector();
binary_fold_it(data.begin(), data.end(), std::minus<int>());
}
void time_rec() {
auto data = random_vector();
binary_fold_rec(data.begin(), data.end(), std::minus<int>());
}
int main() {
evaluate(time_rec, "Recursive elapsed:");
evaluate(time_it, "Iterative elapsed:");
return 0;
}
答案 1 :(得分:0)
我有两个TMP版本。哪个更好,取决于数据类型,我想:
解决方案A:
首先,让我们找到分裂点的良好偏移(2的幂似乎很棒):
template<std::ptrdiff_t diff, std::ptrdiff_t V = 2>
struct offset
{
static constexpr std::ptrdiff_t value =
(V * 2 < diff - 1) ? offset<diff, V * 2>::value : V;
};
// End recursion
template<std::ptrdiff_t diff>
struct offset<diff, 1<<16>
{
static constexpr std::ptrdiff_t value = 1<<16;
};
// Some special cases
template<>
struct offset<0, 2>
{
static constexpr std::ptrdiff_t value = 0;
};
template<>
struct offset<1, 2>
{
static constexpr std::ptrdiff_t value = 0;
};
template<>
struct offset<2, 2>
{
static constexpr std::ptrdiff_t value = 0;
};
有了这个,我们可以创建一个递归的TMP版本:
template <std::ptrdiff_t diff, class It, class Func>
auto binary_fold_tmp(It begin, It end, Func op)
-> decltype(op(*begin, *end))
{
assert(end - begin == diff);
switch (diff)
{
case 0:
assert(false);
return 0; // This will never happen
case 1:
return *begin;
case 2:
return op(*begin, *(begin + 1));
default:
{ // first round to the nearest multiple of 2 and then advance
It mid{begin};
std::advance(mid, offset<diff>::value);
auto left = binary_fold_tmp<offset<diff>::value>(begin, mid, op);
auto right =
binary_fold_tmp<diff - offset<diff>::value>(mid, end, op);
return op(left, right);
}
}
}
这可以与这样的非TMP版本结合使用,例如:
template <class It, class Func>
auto binary_fold(It begin, It end, Func op)
-> decltype(op(*begin, *end))
{
const auto diff = end - begin;
assert(diff > 0);
switch (diff)
{
case 1:
return binary_fold_tmp<1>(begin, end, op);
case 2:
return binary_fold_tmp<2>(begin, end, op);
case 3:
return binary_fold_tmp<3>(begin, end, op);
case 4:
return binary_fold_tmp<4>(begin, end, op);
case 5:
return binary_fold_tmp<5>(begin, end, op);
case 6:
return binary_fold_tmp<6>(begin, end, op);
case 7:
return binary_fold_tmp<7>(begin, end, op);
case 8:
return binary_fold_tmp<8>(begin, end, op);
default:
if (diff < 16)
return op(binary_fold_tmp<8>(begin, begin + 8, op),
binary_fold(begin + 8, end, op));
else if (diff < 32)
return op(binary_fold_tmp<16>(begin, begin + 16, op),
binary_fold(begin + 16, end, op));
else
return op(binary_fold_tmp<32>(begin, begin + 32, op),
binary_fold(begin + 32, end, op));
}
}
解决方案B:
这会计算成对结果,将它们存储在缓冲区中,然后使用缓冲区调用自身:
template <std::ptrdiff_t diff, class It, class Func, size_t... Is>
auto binary_fold_pairs_impl(It begin,
It end,
Func op,
const std::index_sequence<Is...>&)
-> decltype(op(*begin, *end))
{
std::decay_t<decltype(*begin)> pairs[diff / 2] = {
op(*(begin + 2 * Is), *(begin + 2 * Is + 1))...};
if (diff == 2)
return pairs[0];
else
return binary_fold_pairs_impl<diff / 2>(
&pairs[0],
&pairs[0] + diff / 2,
op,
std::make_index_sequence<diff / 4>{});
}
template <std::ptrdiff_t diff, class It, class Func>
auto binary_fold_pairs(It begin, It end, Func op) -> decltype(op(*begin, *end))
{
return binary_fold_pairs_impl<diff>(
begin, end, op, std::make_index_sequence<diff / 2>{});
}
此模板函数要求diff
为2的幂。但当然,您可以将它与非模板版本结合使用:
template <class It, class Func>
auto binary_fold_mix(It begin, It end, Func op) -> decltype(op(*begin, *end))
{
const auto diff = end - begin;
assert(diff > 0);
switch (diff)
{
case 1:
return *begin;
case 2:
return binary_fold_pairs<2>(begin, end, op);
case 3:
return op(binary_fold_pairs<2>(begin, begin + 1, op),
*(begin + (diff - 1)));
case 4:
return binary_fold_pairs<4>(begin, end, op);
case 5:
return op(binary_fold_pairs<4>(begin, begin + 4, op),
*(begin + (diff - 1)));
case 6:
return op(binary_fold_pairs<4>(begin, begin + 4, op),
binary_fold_pairs<4>(begin + 4, begin + 6, op));
case 7:
return op(binary_fold_pairs<4>(begin, begin + 4, op),
binary_fold_mix(begin + 4, begin + 7, op));
case 8:
return binary_fold_pairs<8>(begin, end, op);
default:
if (diff <= 16)
return op(binary_fold_pairs<8>(begin, begin + 8, op),
binary_fold_mix(begin + 8, end, op));
else if (diff <= 32)
return op(binary_fold_pairs<16>(begin, begin + 16, op),
binary_fold_mix(begin + 16, end, op));
else
return op(binary_fold_pairs<32>(begin, begin + 32, op),
binary_fold_mix(begin + 32, end, op));
}
}
我使用与MtRoad相同的程序进行测量。在我的机器上,差异并不像MtRoad报道的那么大。使用-O3
解决方案A和B似乎比MtRoad的版本略快,但实际上,您需要使用类型和数据进行测试。
备注:我没有严格测试我的版本。