假设有类似的东西:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
}
我可以通过编写类似:
之类的东西,在非对齐访问机器(例如x86)上加快速度void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
for(i=0; i<wordlen; i++)
{
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
}
然而,它需要建立在几个架构上,所以我想做一些像:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
// go slow
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
#else
// go fast
for(i=0; i<wordlen; i++)
{
// the following line will raise SIGBUS on SPARC and other archs that require aligned access.
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i];
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
#endif
}
但我找不到有关编译器定义的宏(如我上面假设的__ALIGNED4__
)的任何好信息,这些宏指定了对齐或使用预处理器确定目标架构对齐的任何巧妙方法。我可以只测试defined (__SVR4) && defined (__sun)
,但我更喜欢在其他需要对齐内存访问的架构上运行 TM 的东西。< / p>
答案 0 :(得分:5)
虽然x86默默地修复了未对齐的访问,但这对性能来说并不是最佳选择。通常最好假设某个对齐并自己执行修正:
unsigned int const alignment = 8; /* or 16, or sizeof(long) */
void memcpy(char *dst, char const *src, unsigned int size) {
if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
/* no common alignment, copy as bytes or shift around */
} else {
if(((intptr_t)dst) % alignment) {
/* copy bytes at the beginning */
}
/* copy words in the middle */
if(((intptr_t)dst + size) % alignment) {
/* copy bytes at the end */
}
}
}
另外,请查看SIMD说明。
答案 1 :(得分:2)
标准方法是使用configure
脚本运行程序来测试对齐问题。如果测试程序没有崩溃,则configure脚本在生成的配置头中定义一个宏,以便更快地实现。更安全的实现是默认的。
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
#if defined(UNALIGNED)
// go fast
for(i=0; i<wordlen; i++)
{
// the following line will raise SIGBUS on SPARC and other archs that require aligned access.
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i];
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
#else
// go slow
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
#endif
}
答案 2 :(得分:1)
(我觉得很奇怪你真的有src
和mask
通勤。我将mask_bytes
重命名为memand
。但无论如何......)
另一个选择是使用利用C中类型的不同函数。例如:
void memand_bytes(char *dest, char *src1, char *src2, size_t len)
{
unsigned int i;
for (i = 0; i < len; i++)
dest[i] = src1[i] & src2[i];
}
void memand_ints(int *dest, int *src1, int *src2, size_t len)
{
unsigned int i;
for (i = 0; i < len; i++)
dest[i] = src1[i] & src2[i];
}
这样你就让程序员决定了。