Question

我正在用Armadillo 4.500.0编写一个程序，我觉得s += v * v.t() * q;之类的现场计算明显慢于s = s + v * v.t() * q;，s的等效v ，q是适当大小的向量。

当我运行以下代码时，原来版本比其他版本慢了很多倍，500个元素慢〜480倍（5.13秒到0.011秒），积极优化（-O3或-Ofast; Apple LLVM版本6.0（clang-600.0.54））。

#include <iostream>
#include <armadillo>
#include <sys/time.h>

using namespace arma;
using namespace std;

#define N_ELEM 500
#define REP 10000

int main(int argc, const char * argv[]) {
    timeval start;
    timeval end;
    double tInplace, tNormal;
    vec s = randu<vec>(N_ELEM);
    vec v = randu<vec>(N_ELEM);
    vec q = randu<vec>(N_ELEM);

    gettimeofday(&start, NULL);

    for(int i = 0; i < REP; ++i) {
        s += v * v.t() * q;
    }

    gettimeofday(&end, NULL);

    tInplace = (end.tv_sec - start.tv_sec + ((end.tv_usec - start.tv_usec) / 1e6));

    gettimeofday(&start, NULL);

    for(int i = 0; i < REP; ++i) {
        s = s + v * v.t() * q;
    }

    gettimeofday(&end, NULL);

    tNormal = (end.tv_sec - start.tv_sec + ((end.tv_usec - start.tv_usec) / 1e6));

    cout << "Inplace: " << tInplace << "; Normal: " << tNormal << " --> " << "Normal is " << tInplace / tNormal << " times faster" << endl;

    return 0;
}

任何人都可以解释为什么现场操作员执行得更糟，尽管它可以使用已有的内存，所以它不需要复制任何内容吗？

Answer 1

在v.t() * q周围放置括号可以解决问题：

for(int i = 0; i < REP; ++i) {
    s += v * (v.t() * q);
}

使用括号强制评估顺序。表达式(v.t() * q)将评估为标量（技术上为1x1矩阵），然后将其用于乘以v向量。括号还会阻止v * v.t()变成明确的外部产品。

使用s = s + v * v.t() * q表达式时，Armadillo可以自动解决此问题，但在使用inplace运算符+=时，它（当前）需要更多提示。

犰狳inplace_plus明显慢于“正常”加上操作

1 个答案: