Question

我正在处理点云数据（每个云150k点）。我希望，对于每个（x，y）点，计算到参考点O的距离和方位角：

for each point p in points
    dx = p.x - ox
    dy = p.y - oy
    d = hypot(dx, dy)
    az = atan2(dy, dx)

我有一个手动SSE实现。我希望使用eigen使代码更清晰：

ArrayXf x(points.size()), y(points.size());
for(unsigned i=0; i<points.size(); ++i) {
    x[i] = points[i].x;
    y[i] = points[i].y;
}
const ArrayXf d = (dx.square() + dy.square()).sqrt();
// implement a polynomial approximation to atan (same as the SSE)

然而，从我的计时实验来看，这似乎根本没有矢量化，因为时间与基线实施相同。我知道SSE2已启用，因为我在同一个文件中编译了一些SSE2代码。

然而，根据文档，Eigen在支持时（和3.3中的AVX）确实利用了SSE2。它只适用于向量和矩阵运算吗？

编辑：我研究了生产的汇编代码，它确实包含了一些SSE指令。但它仍然很慢

编辑：这里有更多的时间信息。我循环超过100帧，每帧大约150k点。

没有atan2的天真实现：150ms
sse实施（处理点4乘4并丢弃最后几个不填满整个数据包）：30ms
使用特征映射的特征实现：90ms（diff：36ms，hypot：16ms，index：17ms）

这是我的本征代码：

const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > px(&(points[0].x), points.size());
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > py(&(points[0].y), points.size());

// difference with the origin (ox and oy are floats)
const Eigen::ArrayXf dx = px - ox, dy = py - oy;

// distance and index
const Eigen::ArrayXf d = sqrt(dx.square() + dy.square());

static const float r_res_mult = 1.0f / r_res; //2x faster than div
const Eigen::ArrayXi didx = (d * r_res_mult).cast<int>();

Answer 1

您的主要问题是您的数据格式不适合SIMD。您正在使用结构数组（xyxyxyxyxyxy ...），然后对您执行的代码进行矢量化

for(unsigned i=0; i<points.size(); ++i) {
    x[i] = points[i].x;
    y[i] = points[i].y;
}

转换为数组结构（xxxxxxxx .... yyyyyyy ...）。这种转换很昂贵。

更好的解决方案是将您的点存储为数组结构。更好的解决方案是使用数组的混合结构，也就是数组结构的数组。对于SSE，假设您使用单浮点，那么您将执行xxxxyyyyxxxxyyyy ....

接下来我建议您使用SIMD数学库。英特尔提供昂贵且封闭源代码的SVML。 AMD提供libm这是免费但封闭的来源。但是这些图书馆在竞争对手的硬件上都不能很好地发挥作用。最好的SIMD库是Agner Fog的Vector Class Library (VCL) 。它是开源的，免费的，并且可以在Intel和AMD处理器上进行优化。它也像Eigen一样只是头文件，因此，像Eigen一样，你不必编译和链接库。您刚刚包含了头文件。以下是如何使用SSE或AVX for float（VLC将在没有AVX的系统上模拟AVX）。

//    g++ -O3 -Ivectorclass -msse4.2 foo.cpp
// or g++ -O3 -Ivectorclass -mavx foo.cpp
#include <vectorclass.h>
#include <vectormath_trig.h>

struct Point2DBlock {
    float x[8];
    float y[8];
};

int main(void) {
    const int nblocks = 10; //each block contains eight points
    Point2DBlock aosoa[nblocks]; //xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy ...
    float ox = 0.0f, oy = 0.0f;
    Vec8f vox = ox, voy = oy;
    for(int i=0; i<nblocks; i++) {
        Vec8f dx = Vec8f().load(aosoa[i].x) - vox;
        Vec8f dy = Vec8f().load(aosoa[i].y) - voy;
        Vec8f d  = sqrt(dx*dx + dy*dy);
        Vec8f az = atan2(dy,dx);
    } 
}

如果你真的需要hypot。您可以使用pseudo-code from wikipedia从VCL构建一个。

static inline Vec8f hypot(Vec8f const &x, Vec8f const &y) {
    Vec8f t;
    Vec8f ax = abs(x), ay = abs(y);
    t  = min(ax,ay);
    ax = max(ax,ay);
    t  = t/ax;
    return ax*sqrt(1+t*t);
}

编辑：

这是一个使用结构数组的方法。这需要一些改组，但与其他计算相比，这可能是微不足道的。 VLC使用模板元编程来确定一种有效的混洗方法。

#include <vectorclass.h>
#include <vectormath_trig.h>

int main(void) {
    const int npoints=80;
    float points[2*npoints]; //xyxyxyxyxyxy...
    float ox = 0.0, oy = 0.0;
    Vec8f vox = ox, voy = oy;
    for(int i=0; i<npoints; i+=16) {
        Vec8f l1 = Vec8f().load(&points[i+0]);
        Vec8f l2 = Vec8f().load(&points[i+8]);
        Vec8f dx = blend8f<0, 2, 4, 6, 8, 10, 12, 14>(l1,l2) - vox;
        Vec8f dy = blend8f<1, 3, 5, 7, 9, 11, 13, 15>(l1,l2) - voy;
        Vec8f d  = sqrt(dx*dx + dy*dy);
        Vec8f az = atan2(dy,dx);
    } 
}

Answer 2

副本需要花费很多时间。与计算本身相同或更长。你不必复制那样的数据。它冗长而且可能更慢。您可以改为使用地图，甚至可以直接使用地图进行计算。我写了一个快速演示：

int sz = 15000000;
std::vector<Point> points(sz);

Eigen::Map<ArrayXd, Unaligned, InnerStride<2>> mapX(&(points[0].x), sz);
Eigen::Map<ArrayXd, Unaligned, InnerStride<2>> mapY(&(points[0].y), sz);

mapX = ArrayXd::Random(sz);
mapY = ArrayXd::Random(sz);

auto cpstart = std::chrono::high_resolution_clock::now();
ArrayXd x = mapX;
ArrayXd y = mapY;
ArrayXd d;
auto cpend = std::chrono::high_resolution_clock::now();

auto mpSumstart = std::chrono::high_resolution_clock::now();

d = (mapX.square() + mapY.square()).sqrt().eval();

auto mpSumend = std::chrono::high_resolution_clock::now();

std::cout << d.mean() << "\n";

auto arStart = std::chrono::high_resolution_clock::now();

d = (x.square() + y.square()).sqrt().eval();

auto arEnd = std::chrono::high_resolution_clock::now();

std::cout << d.mean() << "\n";

auto elapsed = cpend - cpstart;
std::cout << "Copy: " <<  elapsed.count() << '\n';
std::cout << "Map: " <<  (mpSumend - mpSumstart).count() << '\n';
std::cout << "Array: " <<  (arEnd - arStart).count() << '\n';

我得到的时间是数组长度的100倍，我只是懒得写一个循环来测试更好。我的系统上的副本大约需要90毫秒（VS2012 / Ox版本（-DNDEBUG）），映射版本为185毫秒，复制的阵列大约为90毫秒。对于SIMD操作，大约两倍的因子是有意义的，因为映射版本会跳过每隔一个双倍。如果你有一个数组结构而不是一个结构数组，那么map的性能应该与复制数组的性能相当。

编辑：我定义了EIGEN_DONT_VECTORIZE，复制的数组（几乎）将时间加倍（如预期的那样）。然而，地图保持不变。好奇。可能与未对齐的地图有关。或者只是因为只有两个双打的空间而且每个其他一个都属于错误的地图。

编辑2 对于问题中提出的具体问题，一个愚蠢的想法让我感到震惊。您可以将x，y值视为std::complex<double>，然后将其作为单个块加载，而不使用内存副本：

Eigen::Map<ArrayXcd> mapC((std::complex<double>*)(&(points[0].x)), sz);
//...
cd = mapC.cwiseAbs2().sqrt().eval();

时间仅略长于计算机上预先复制的阵列。您也可以将原点作为复数进行减去

cd = (mapC - std::complex<double>(ox, oy)).cwiseAbs2().sqrt().eval();

具有数组的特征向量化

2 个答案: