Question

请考虑以下头文件“ tls.h”：

#include <stdint.h>

// calling this function is expensive
uint64_t foo(uint64_t x);

extern __thread uint64_t cache;

static inline uint64_t
get(uint64_t x)
{
    // if cache is not valid
    if (cache == UINT64_MAX)
        cache = foo(x);

    return cache + x;
}

和源文件“ tls.c”：

#include "tls.h"

__thread uint64_t cache = {0};

uint64_t foo(uint64_t x)
{
    // imagine some calculations are performed here
    return 0;
}

下面是“ main.c”中get()函数的用法示例：

#include "tls.h"

uint64_t t = 0;

int main()
{
    uint64_t x = 0;

    for(uint64_t i = 0; i < 1024UL * 1024 * 1024; i++){
        t += get(i);
        x++;
    }
}

呈现的文件如下编译：

gcc -c -O3 tls.c
gcc -c -O3 main.c
gcc -O3 main.o tls.o

检查“ main.c”中循环的性能表明，编译器优化非常差。拆解二进制文件后，很明显，每次迭代都访问tls。我的机器上的执行时间为1.7秒。

但是，如果我删除get()方法中的缓存有效性检查，则它看起来像这样：

static inline uint64_t
get(uint64_t x)
{
    return cache + x;
}

编译器现在能够创建速度更快的代码-它完全消除了循环并仅生成一条“添加”指令。执行时间约为0.02s。

为什么编译器无法优化第一种情况？ TLS变量不能被其他线程更改，因此编译器应该能够对此进行优化，对吧？

还有其他方法可以优化get()函数吗？

为什么编译器无法优化从TLS的读取？

0 个答案: