Question

今天，当我想到一个问题时，我正在写一个池分配器：
是否有可能击败编译器？

打败编译器，我的意思是编写比其最简单的版本（堆栈上的分配变量，一个接一个）更快地执行内存分配（更少的时钟周期）的代码。

所以我想出了一个非常简单的BytePool：

template <size_t PoolSize>
class BytePool
{
public:
    template <typename T>
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

这个简单的代码使我可以在堆栈上分配一次字节数组，然后像访问T一样访问它

为了操纵这个数组，我做了一个宏：

#define is(type, slot) bytePool.At<type>(slot)

该宏允许我编写：#define a is (int, 0x0000)，例如a是一个伪变量，指向bytePool[sizeof(int) * 0x0000]。

使用此宏，我编写了一段简单的代码，该代码使用一些数字执行基本操作（一些在编译时定义，而某些在运行时定义，例如b和c）：

BytePool<sizeof(int) * 6> bytePool;

#define is(type, slot) bytePool.At<type>(slot)

#define a is (int, 0x0000)
#define b is (int, 0x0001)
#define c is (int, 0x0002)
#define d is (int, 0x0003)
#define e is (int, 0x0004)
#define f is (int, 0x0004)

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

#undef a
#undef b
#undef c
#undef d
#undef e
#undef f

好玩！这段代码看起来像我为变量手动分配的内存插槽。

不使用ByteAllocator的等效项是：

int a;
int b;
int c;
int d;
int e;
int f;

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

我此时问自己的问题是：
哪种方法更好？

在堆栈上分配sizeof（int）6次
在堆栈上分配sizeof（int）* 6 1次

自然，我确定一次分配内存的速度更快。所以我想我的BytePool方法更快。

现在，让我们听听编译器。我写了一些基准测试代码：

#include <iostream>
#include <intrin.h>
#include <ctime>

template <size_t PoolSize>
class BytePool
{
public:
    template <typename T>
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

void Stack()
{
    int a;
    int b;
    int c;
    int d;
    int e;
    int f;

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;
}

void Pool()
{
    BytePool<sizeof(int) * 6> bytePool;

    #define is(type, slot) bytePool.At<type>(slot)

    #define a is (int, 0x0000)
    #define b is (int, 0x0001)
    #define c is (int, 0x0002)
    #define d is (int, 0x0003)
    #define e is (int, 0x0004)
    #define f is (int, 0x0004)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastPool()
{
    int fastBytePool[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastHeapPool()
{
    int* fastBytePool = new int[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f

    delete[] fastBytePool;
}

size_t Benchmark(void (p_function)(), size_t p_iterations)
{
    size_t cycleSum = 0;

    for (size_t it = 0; it < p_iterations; ++it)
    {
        size_t startCycles = __rdtsc();
        p_function();
        cycleSum += __rdtsc() - startCycles;
    }

    return cycleSum / p_iterations;
}

int main()
{
    const size_t iterations = 100000;

    while (true)
    {
        std::cout << "Stack():        \t" << Benchmark(Stack, iterations)           <<  "\tcycles\n";
        std::cout << "Pool():         \t" << Benchmark(Pool, iterations)            <<  "\tcycles\n";
        std::cout << "FastPool():     \t" << Benchmark(FastPool, iterations)        <<  "\tcycles\n";
        std::cout << "FastHeapPool(): \t" << Benchmark(FastHeapPool, iterations)    <<  "\tcycles\n";

        std::cin.get();

        system("CLS");
    }

    return 0;
}

这4个测试是：

堆栈（经典方式）
池（在堆栈上预分配一个字节池）
FastPool（在没有类抽象，没有方法调用的情况下，在堆栈上预分配一个字节池）
FastHeapPool（在堆上预分配字节池，而无需类抽象，无需方法调用）

这是使用C ++ 17的MSVC v142的结果：

调试

发布

好吧...这不是我期望的！

FastPool的出现等同于经典方式。这意味着6个分配与1个大分配并没有很大的区别。
简单的Pool（使用BytePool类）非常慢，我想这是由于方法调用所致，它似乎在发布模式下得到了优化。
FastHeapPool是一场灾难，即使在发布模式下，堆分配和访问似乎也很慢（这是我所期望的）

所以现在，我的问题是：

有没有一种方法可以击败经典方法（堆栈上有6个分配），为什么分配6倍int大小等于分配一次6 int大小

我只说说内存，而不是关于操作优化

Answer 1

您的测试存在严重缺陷。方法Stack（），Pool（）和FastPool（）将归结为NOP（它们什么都不做！）。但是，new / delete可能会有副作用，因此可以考虑释放性能的差异。现在，您可能需要了解堆栈分配实际上是做什么的！如果在方法中使用堆栈分配的变量，则它很可能是寄存器（除非它是具有副作用的非pod类型），并且您尝试创建以模仿内存的任何疯狂概念都将只是命令。由于延迟，缓存未命中等原因，速度降低了

在过去，我们曾经使用register关键字来区分分配给var和寄存器的堆栈。没有了，因为它基本上没有意义。这些天的堆栈分配仅在寄存器用完时发生，并且需要将寄存器值换出到堆栈空间。

Answer 2

我将忽略您的代码，因为我无法说出哪个版本应该更快...

无论如何，您似乎对编译器的工作方式有误解。现代的编译器都没有逐行翻译程序。它们都生成一个所谓的abstract syntax tree（AST）-表示程序的作用。然后，对该语法树进行大量修改，以使您获得最佳的性能优化。（展开循环，预先计算值，...）最后，编译器的后端从语法树生成一个可执行文件，该文件针对您的系统进行了优化。（如果有的话，可以使用机器专用的说明。）

由于所有这些阶段，很难猜测您的c ++会生成什么机器代码。在许多情况下，编译器甚至可以通过完全不同的编程方法生成相同的机器代码。因此，在您的示例中，如果不查看二进制文件，就不可能说出哪些代码运行得更快。

由于编写方式的原因， fast 版本的运行速度很可能很慢。编译器喜欢简单的代码。但是，您的版本是用复杂的方式编写的，因此编译器很难对其进行优化。

如果您对编译器和优化感兴趣，则应签出：

Matt Godbolts Compiler Explorer-一个可以实际比较编译器二进制输出的网站
Matt Godbolts谈话：“What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”-他详细介绍了现代编译器可以完成的惊人优化

我们能击败编译器吗？

2 个答案: