Question

我写了一段代码，用于计算0到255之间数字的频率。

unsigned char arr[4096]; //aligned 64 bytes, filled with random characters

short counter[256]; //aligned 32 bytes

register int i;

for(i = 0; i < 4096; i++)
    ++counter[arr[i]];

执行需要花费大量时间;随机访问计数器阵列非常昂贵。

有没有人有任何想法可以用来使访问顺序或我可以使用的任何其他方法？

Answer 1

是什么让你认为对计数器阵列的随机访问是昂贵的？你介绍过吗？试试Valgrind，它有一个名为“cachegrind”的缓存分析工具。分析还可以让您知道代码是否实际上很慢，或者您认为代码是否因为它应该是缓慢的。

这是一段非常简单的代码，在优化之前重要的是要知道它是否是内存绑定的，或者它是否不受内存限制（w.r.t.数据，而不是直方图表）。我不能回答这个问题。尝试比较一个简单的算法，它只是对整个输入求和：如果两者都以大约相同的速度运行，那么你的算法是内存限制的，你就完成了。

我最好的猜测是，可能会让你失望的主要问题是：

   Registers                      RAM
1.  <-- read data[i] ---------------
2.  <-- read histogram[data[i]] ----
3. increment
4.  --- write histogram[data[i]] -->
5.  <-- read data[i] ---------------
6.  <-- read histogram[data[i]] ----

不允许编译器和处理器对这里的大多数指令进行重新排序（＃1和＃5除外，这可以提前完成）所以你基本上会受到较小者的限制：你的带宽L1缓存（直方图所在的位置）和主RAM的带宽，每个乘以一些未知的常数因子。（注意：如果编译器展开循环，编译器只能移动＃1/5，但处理器可能无论如何都可以移动它。）

这就是为什么你在尝试变得聪明之前进行分析的原因 - 因为如果你的L1缓存有足够的带宽，那么你总是会渴望数据，而你无能为力。

<强>脚注：

此代码：

register int i;
for(i = 0; i < 4096; i++)
    ++counter[arr[i]];

生成与此代码相同的程序集：

int i;
for(i = 0; i < 4096; i++)
    counter[arr[i]]++;

但是这段代码更容易阅读。

Answer 2

更惯用：

// make sure you actually fill this with random chars
// if this is declared in a function, it _might_ have stack garbage
// if it's declared globally, it will be zeroed (which makes for a boring result)
unsigned char arr[4096]; 
// since you're counting bytes in an array, the array can't have more
// bytes than the current system memory width, so then size_t will never overflow
// for this usage
size_t counter[256];

for(size_t i = 0; i < sizeof(arr)/sizeof(*arr); ++i)
    ++counter[arr[i]];

现在关键是用C99编译，以及一些严重的优化标志：

cc mycode.c -O3 -std=c99

在这样的简单循环上进行任何优化都会非常快速。不要浪费更多时间来更快地制作这样的东西。

Answer 3

首先，我完全赞同迪特里希，请先证明（你自己和我们）真正的瓶颈所在。首先。

我能看到的唯一可能改进是short。这个表的大小在这里不会有问题，我想，但促销和溢出。默认使用处理此类型的类型，即unsigned。

无论如何，计数器应始终为unsigned（甚至更好size_t），这是基数的语义。作为额外优势，无符号类型不会溢出，而是以控制方式包裹arround。编译器不必为此使用附加指令。

然后，C中的算术的宽度至少为int的宽度。然后必须将其重新缩短。

Answer 4

代码获取大小为4k的数据...它每3个连续字节添加一次，并将结果存储在大小为4k的临时缓冲区中。临时缓冲区用于生成直方图。

可以使用SIMD指令添加3个连续字节进行矢量化。

根据Dietrich的建议，如果不是生成直方图，我只是在临时缓冲区中添加值，它执行速度非常快。但直方图的生成是需要时间的部分。我使用缓存研磨进行了代码分析...输出为：

==11845== 
==11845== I   refs:      212,171
==11845== I1  misses:        842
==11845== LLi misses:        827
==11845== I1  miss rate:    0.39%
==11845== LLi miss rate:    0.38%
==11845== 
==11845== D   refs:       69,179  (56,158 rd   + 13,021 wr)
==11845== D1  misses:      2,905  ( 2,289 rd   +    616 wr)
==11845== LLd misses:      2,470  ( 1,895 rd   +    575 wr)
==11845== D1  miss rate:     4.1% (   4.0%     +    4.7%  )
==11845== LLd miss rate:     3.5% (   3.3%     +    4.4%  )
==11845== 
==11845== LL refs:         3,747  ( 3,131 rd   +    616 wr)
==11845== LL misses:       3,297  ( 2,722 rd   +    575 wr)
==11845== LL miss rate:      1.1% (   1.0%     +    4.4%  )

，完整的输出是：

I1 cache:         65536 B, 64 B, 2-way associative
D1 cache:         65536 B, 64 B, 2-way associative
LL cache:         1048576 B, 64 B, 16-way associative
Command:          ./a.out
Data file:        cachegrind.out.11845
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
     Ir I1mr ILmr     Dr  D1mr  DLmr     Dw D1mw DLmw 
--------------------------------------------------------------------------------
212,171  842  827 56,158 2,289 1,895 13,021  616  575  PROGRAM TOTALS

--------------------------------------------------------------------------------
    Ir I1mr ILmr     Dr  D1mr  DLmr     Dw D1mw DLmw  file:function
--------------------------------------------------------------------------------
97,335  651  642 26,648 1,295 1,030 10,883  517  479  ???:???
59,413   13   13 13,348   886   829     17    1    0  ???:_dl_addr
40,023    7    7 12,405    10     8    223   18   17  ???:core_get_signature
 5,123    2    2  1,277    64    19    256   64   64  ???:core_get_signature_parallel
 3,039   46   44    862     9     4    665    8    8  ???:vfprintf
 2,344   11   11    407     0     0    254    1    1  ???:_IO_file_xsputn
   887    7    7    234     0     0    134    1    0  ???:_IO_file_overflow
   720    9    7    250     5     2    150    0    0  ???:__printf_chk
   538    4    4    104     0     0    102    2    2  ???:__libc_memalign
   507    6    6    145     0     0    114    0    0  ???:_IO_do_write
   478    2    2     42     1     1      0    0    0  ???:strchrnul
   350    3    3     80     0     0     50    0    0  ???:_IO_file_write
   297    4    4     98     0     0     23    0    0  ???:_IO_default_xsputn

Answer 5

嗯，理查德肯定是对的。这是因为编译器必须将数组转换为指针，但这需要一些时间，从而增加了执行时间。例如，试试这个：

for(i = 0; i < 4096; i++)
     ++*(counter+*(arr+i));

Answer 6

考虑使用指向arr的指针，而不是索引。

unsigned char p = &arr;
for (i = 4096-1; 0 <= i; --i)
  ++counter[*p++];

使C代码运行得更快

6 个答案: