我的目标是使用简单的代码来衡量(不同)缓存的效果。我关注这篇文章,特别是第20页和第21页: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
我正在使用64位Linux。 L1d缓存为32K,L2为256K,L3为25M。
这是我的代码(我使用没有标志的g ++编译此代码):
#include <iostream>
// ***********************************
// This is for measuring CPU clocks
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
// ***********************************
static const int ARRAY_SIZE = 100;
struct MyStruct {
struct MyStruct *n;
};
int main() {
MyStruct myS[ARRAY_SIZE];
unsigned long long cpu_checkpoint_start, cpu_checkpoint_finish;
// Initializing the array of structs, each element pointing to the next
for (int i=0; i < ARRAY_SIZE - 1; i++){
myS[i].n = &myS[i + 1];
for (int j = 0; j < NPAD; j++)
myS[i].pad[j] = (long int) i;
}
myS[ARRAY_SIZE - 1].n = NULL; // the last one
for (int j = 0; j < NPAD; j++)
myS[ARRAY_SIZE - 1].pad[j] = (long int) (ARRAY_SIZE - 1);
// Filling the cache
MyStruct *current = &myS[0];
while ((current = current->n) != NULL)
;
// Sequential access
current = &myS[0];
// For CPU usage in terms of clocks (ticks)
cpu_start = rdtsc();
while ((current = current->n) != NULL)
;
cpu_finish = rdtsc();
unsigned long long avg_cpu_clocks = (cpu_finish - cpu_start) / ARRAY_SIZE;
std::cout << "Avg CPU Clocks: " << avg_cpu_clocks << std::endl;
return 0;
}
我有两个问题:
1-我将ARRAY_SIZE从1变为1,000,000(因此我的数组大小介于2B到2MB之间),但平均CPU时钟始终为10。
根据该PDF(第21页的图3-10),当阵列完全适合L1时,我希望得到3-5个时钟,当它超过L1和#39时,我会得到更高的数字(9个周期)。 ;尺寸。
2-如果我将ARRAY_SIZE增加到1,000,000以上,我会得到分段错误(核心转储),这是由于堆栈溢出造成的。我的问题是使用动态分配(MyStruct *myS = new MyStruct[ARRAY_SIZE]
)是否会导致任何性能损失。
答案 0 :(得分:3)
This is my code (I compile this code with g++ with no flags)
If you don't pass <div id='att-<?php echo $rowID; ?>'></div>
<script type="text/javascript">
$("#att-<?php echo $rowID; ?>").countdown({
until : <?php echo abs(time()-($show['date']+$show['time'])); ?>,
layout : " in {hn} {hl}, {mn} {ml} and {sn} {sl}",
onExpiry : function () {
$("#att-<?php echo $rowID; ?>").html("Boot finished");
$("tr.boot-<?php echo $rowID; ?>").delay(6666).fadeOut();
}
});
</script>
, then -O3
will be compiled in to multiple memory accesses, not a single load instruction. By passing while ((current = current->n) != NULL)
, the loop will be compiled into:
-O3
This will run at 4 cycles per iteration as you are expecting.
Note that you can use the .L3:
mov rax, QWORD PTR [rax]
test rax, rax
jne .L3
compiler intrinsic instead of inline assembly. See: Get CPU cycle count?.