Question

据我所知，CPU在边界上对齐的数据表现最佳，该边界等于该数据的大小。例如，如果每个int数据的大小为4个字节，那么每个int的地址必须是4的倍数才能使CPU满意;与2字节short数据和8字节double数据相同。因此，new运算符和malloc函数始终返回的地址为8的倍数，因此是4和2的倍数。

在我的程序中，一些用于处理大字节数组的时间关键算法允许通过将每个连续的4个字节转换为unsigned int来跨越计算，并且通过这种方式，可以更快地完成算术运算。但是，字节数组的地址不保证是4的倍数，因为只需要处理一部分字节数组。

据我所知，英特尔CPU正确处理未对齐的数据，但代价是速度。如果对未对齐的数据进行操作的速度足够慢，则需要重新设计程序中的算法。在这方面，我有两个问题，第一个问题支持以下代码：

// the address of array0 is a multiple of 4:
unsigned char* array0 = new unsigned char[4];
array0[0] = 0x00;
array0[1] = 0x11;
array0[2] = 0x22;
array0[3] = 0x33;
// the address of array1 is a multiple of 4 too:
unsigned char* array1 = new unsigned char[5];
array1[0] = 0x00;
array1[1] = 0x00;
array1[2] = 0x11;
array1[3] = 0x22;
array1[4] = 0x33;
// OP1: the address of the 1st operand is a multiple of 4,
// which is optimal for an unsigned int:
unsigned int anUInt0 = *((unsigned int*)array0) + 1234;
// OP2: the address of the 1st operand is not a multiple of 4:
unsigned int anUInt1 = *((unsigned int*)(array1 + 1)) + 1234;

所以问题是：

与x86，x86-64和安腾处理器上的OP1相比，OP2的速度要慢多少（如果忽略了类型转换和地址增量的成本）？
在编写跨平台可移植代码时，关于错位数据访问应该关注哪种处理器？（我已经了解RISC的内容）

Answer 1

市场上有太多的处理器无法提供通用答案。唯一可以确定的是，某些处理器根本无法进行未对齐访问;如果您的程序打算在同类环境中运行，例如，这可能对您有意或无关紧要，例如：视窗。

在现代高速处理器中，未对齐访问的速度可能受其缓存对齐的影响大于其地址对齐。在今天的x86处理器上，缓存行大小为64字节。

有一篇维基百科文章可能会提供一些一般性指导：http://en.wikipedia.org/wiki/Data_structure_alignment

未对齐数据的操作速度

1 个答案: