Question

请考虑以下代码段

double *x, *id;
int i, n; // = vector size

// allocate and zero x
// set id to 0:n-1

for(i=0; i<n; i++) {  
  long iid = (long)id[i];
  if(iid>=0 && iid<n && (double)iid==id[i]){
    x[iid] = 1;
  } else break;
}

代码使用id类型的向量double中的值作为向量x的索引。为了使索引有效，我验证它们是否大于或等于0，小于向量大小n，并且存储在id中的双精度实际上是整数。在此示例中，id存储从1到n的整数，因此所有向量都是线性访问的，if语句的分支预测应始终有效。

对于n=1e8，我的计算机上的代码需要0.21秒。因为在我看来它是一个计算轻量级的循环，我希望它是有限的内存带宽。基于基准内存带宽，我预计它将以0.15秒运行。我计算内存占用量为每id值8个字节，每个x值16个字节（它需要写入，并从内存中读取，因为我假设不使用SSE流）。所以每个向量条目总共有24个字节。

问题：

我错误地说这段代码应该是内存带宽限制，并且它可以改进吗？
如果没有，你知道一种方法，我可以提高性能，使其与内存的速度一致吗？
或许一切都很好，除了并行运行之外，我不能轻易改进它？

更改id的类型不是一种选择 - 它必须是double。此外，在一般情况下，id和x具有不同的大小，并且必须保持为单独的数组 - 它们来自程序的不同部分。简而言之，我想知道是否有可能以更有效的方式编写绑定检查和类型转换/整数验证。

为方便起见，整个代码：

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static struct timeval tb, te;

void tic()
{
  gettimeofday(&tb, NULL);
}

void toc(const char *idtxt)
{
  long s,u;
  gettimeofday(&te, NULL);
  s=te.tv_sec-tb.tv_sec;
  u=te.tv_usec-tb.tv_usec;
  printf("%-30s%10li.%.6li\n", idtxt, 
     (s*1000000+u)/1000000, (s*1000000+u)%1000000);
}

int main(int argc, char *argv[])
{
  double *x  = NULL;
  double *id = NULL;
  int i, n;

  // vector size is a command line parameter
  n = atoi(argv[1]);
  printf("x size %i\n", n);

  // not included in timing in MATLAB
  x = calloc(sizeof(double),n);
  memset(x, 0, sizeof(double)*n);

  // create index vector
  tic();
  id  = malloc(sizeof(double)*n);
  for(i=0; i<n; i++) id[i] = i;
  toc("id = 1:n");

  // use id to index x and set all entries to 4
  tic();
  for(i=0; i<n; i++) {  
    long iid = (long)id[i];
    if(iid>=0 && iid<n && (double)iid==id[i]){
      x[iid] = 1;
    } else break;
  }
  toc("x(id) = 1");
}

Answer 1

编辑：如果你不能拆分数组，请忽略！

我认为可以通过利用通用缓存概念来改进它。您可以在时间或位置上关闭数据访问。通过紧密的for循环，您可以通过塑造像for循环这样的数据结构来实现更好的数据命中率。在这种情况下，您可以访问两个不同的数组，通常每个数组中的索引相同。您的机器每次迭代通过该循环加载两个阵列的块。要增加每个加载的使用，请创建一个结构来保存每个数组的元素，并使用该结构创建一个数组：

struct my_arrays
{
    double x;
    int id;
};

struct my_arrays* arr = malloc(sizeof(my_arrays)*n);

现在，每次将数据加载到缓存中时，您都会点击加载的所有内容，因为阵列靠得很近。

编辑：因为你的意图是检查一个整数值，并且你明确假设这些值足够小，可以精确地表示为双精度而不会损失精度，那么我认为你的比较很好。

我之前的回答提到了在隐式转换后要注意比较大双，我引用了这个： What is the most effective way for float and double comparison?

Answer 2

考虑double类型representation。

可能值得考虑

例如，以下代码显示了如何比较大于1到999的double数字：

bool check(double x)
{
    union
    {
        double d;
        uint32_t y[2];
    };
    d = x;
    bool answer;
    uint32_t exp = (y[1] >> 20) & 0x3ff;
    uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
    uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
    if (fraction2 != 0 || fraction1 != 0)
        answer = false;
    else if (exp > 8)
        answer = false;
    else if (exp == 8)
        answer = (y[1] < 0x408f3800); // this is the representation of 999
    else
        answer = true;
    return answer;
}

这看起来很多代码，但它可能很容易矢量化（使用例如SSE），如果你的边界是2的幂，它可能会进一步简化代码。

高效的索引绑定检查和双重转换

2 个答案: