Question

在进行一些性能测试时，我遇到了一些我似乎无法解释的情况。

我写了以下C代码：

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

我使用gcc将它与测试驱动程序一起编译成单个二进制文件。我还使用gcc将它自己编译成一个共享对象，我通过p / invoke从C＃调用它。目的是衡量从C＃调用本机代码的性能开销。

在C和C＃中，我创建等长的随机值输入数组，然后测量multi_arr运行所需的时间。在C＃和C中，我使用POSIX clock_gettime（）调用进行计时。我已经在调用multi_arr之前和之后定位了定时调用，因此输入准备时间等不会影响结果。我运行100次迭代并报告平均时间和最小时间。

即使C和C＃正在执行完全相同的功能，C＃也会在大约50％的时间内提前出现，通常是大量的。例如，对于1,048,576的len，C＃的最小值为768,400 ns，而C的最小值为1,344,105。 C＃的平均值为1,018,865，而C的1,852,880。我在这个图表中添加了一些不同的数字（记住日志比例）：

这些结果对我来说似乎非常错误，但工件在多个测试中是一致的。我检查了asm和IL来验证是否正确。比特是一样的。我不知道在这个程度上可能会影响性能。我把最小的复制示例放在here上。

这些测试全部在Linux（KDE neon，基于Ubuntu Xenial）上运行，使用dotnet-core 2.0.0和gcc 5.0.4。

有没有人见过这个？

Answer 1

它依赖于对齐，正如您已经怀疑的那样。返回内存，以便编译器可以将其用于在存储或检索数据类型（如双精度或整数）时不会导致不必要的错误的结构，但它不会对内存块如何适应缓存做出承诺。 / p>

这种变化取决于您测试的硬件。假设你在这里谈论的是x86_64，这意味着英特尔或AMD处理器及其与主存储器访问相比的缓存相对速度。

您可以通过各种路线测试来模拟这一点。

这是我拼凑在一起的示例程序。在我的i7上，我看到了很大的变化，但第一个最不对齐的访问速度比更对齐的版本慢得多。

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

uint64_t getnsec()
{
  struct timespec n;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &n);
  return (uint64_t) n.tv_sec * 1000000000 + n.tv_nsec;
}

#define CACHE_SIZE (16 * 1024 * 1024 / sizeof(int32_t))
int main()
{
  int32_t *memory;
  int32_t *unaligned;
  int32_t *x;
  int32_t *y;
  int count;
  uint64_t start, elapsed;
  int32_t len = 1024 * 16;
  int64_t aligned = 1;

  memory = calloc(sizeof(int32_t), 4 * CACHE_SIZE);

  // make unaligned as unaligned as possible, e.g. to 0b11111111111111100

  unaligned = (int32_t *) (((intptr_t) memory + CACHE_SIZE) & ~(CACHE_SIZE - 1));
  printf("memory starts at %p, aligned %p\n", memory, unaligned);
  unaligned = (int32_t *) ((intptr_t) unaligned | (CACHE_SIZE - 1));
  printf("memory starts at %p, unaligned %p\n", memory, unaligned);

  for (aligned = 1; aligned < CACHE_SIZE; aligned <<= 1)
  {
    x = (int32_t *) (((intptr_t) unaligned + CACHE_SIZE) & ~(aligned - 1));

    start = getnsec();
    for (count = 1; count < 1000; count++)
    {
      multi_arr(x, x + len, x + len + len, len);
    }
    elapsed = getnsec() - start;
    printf("memory starts at %p, aligned %08"PRIx64" to > cache = %p elapsed=%"PRIu64"\n", unaligned, aligned - 1, x, elapsed);
  }

  exit(0);
}

调用本机代码的C＃比本机调用本机代码更快

1 个答案: