我正在研究一种算法,该算法将8位灰度图像的全局阈值处理为1位(位打包,使1个字节包含8个像素)单色图像。灰度图像中的每个像素可以具有0到255的亮度值。
我的环境是Microsoft Visual Studio C ++中的Win32。
我有兴趣尽可能地优化算法好奇心,1位图像将变成TIFF。目前我将FillOrder设置为MSB2LSB(最高有效位到最低有效位)只是因为TIFF规范表明这一点(它不一定需要是MSB2LSB)
对于那些不知道的人来说,只是一些背景知识:
MSB2LSB在一个字节中从左到右排序像素,就像X坐标增加时像素在图像中的方向一样。如果您在X轴上从左到右遍历灰度图像,这显然需要您在将当前字节中的位打包时“向后”。话虽如此,让我告诉你我现在拥有的东西(这是在C中,我还没有尝试过ASM或Compiler Intrinsics,只是因为我对它的经验很少,但那是可能的)。
因为单色图像每个字节有8个像素,单色图像的宽度将为
(grayscaleWidth+7)/8;
仅供参考,我认为我的最大图像宽度为6000像素:
我做的第一件事(在处理任何图像之前)是
1)根据我的灰度图像给出X坐标,计算一个需要移动到特定字节的量的查找表:
int _shift_lut[6000];
for( int x = 0 ; x < 6000; x++)
{
_shift_lut[x] = 7-(x%8);
}
使用此查找表,我可以将单色位值打包到我正在处理的当前字节中,例如:
monochrome_pixel |= 1 << _shift_lut[ grayX ];
最终会比执行
提高速度monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];
我计算的第二个查找表是一个查找表,它告诉我在给定灰度像素上的X像素时,我的单色像素的X索引。这个非常简单的LUT计算如下:
int xOffsetLut[6000];
int element_size=8; //8 bits
for( int x = 0; x < 6000; x++)
{
xOffsetLut[x]=x/element_size;
}
这个LUT允许我做像
这样的事情monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels
我的灰度图像是一个简单的无符号字符*,我的单色图像也是如此;
这是我初始化单色图像的方式:
int bitPackedScanlineStride = (grayscaleWidth+7)/8;
int bitpackedLength=bitPackedScanlineStride * grayscaleHeight;
unsigned char * bitpack_image = new unsigned char[bitpackedLength];
memset(bitpack_image,0,bitpackedLength);
然后我调用我的binarize函数:
binarize(
gray_image.DataPtr(),
bitpack_image,
globalFormThreshold,
grayscaleWidth,
grayscaleHeight,
bitPackedScanlineStride,
bitpackedLength,
_shift_lut,
xOffsetLut);
这是我的Binarize函数(你可以看到我做了一些循环展开,这可能会有所帮助,也可能没有帮助。)
void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] )
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth;
int checkY=grayscaleHeight;
for ( int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for( int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我知道这个算法可能会遗漏每行中的一些尾随像素,但不要担心。
正如您可以看到的每个单色字节,我处理8个灰度像素。
你看到的地方 pel8&LT =阈 是一个巧妙的小技巧,可以解析为0或1,并且比{} else {}
快得多对于X的每个增量,我将一个位打包到比前一个X
更高的位因此对于灰度图像中的第一组8个像素
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
这就是字节中的位看起来(显然每个编号位只是处理相应编号像素的阈值结果,但你明白了)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
应该是PHEW 。随意玩一些巧妙的技巧,可以从这个算法中挤出更多的果汁。
启用编译器优化后,此功能在核心2 duo机器上的大约5000 x 2200像素图像上平均需要16毫秒。
修改
R ..的建议是删除移位LUT并且只使用实际上完全合乎逻辑的常数......我已经修改了每个像素的OR'ing:
void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] )
{
int yoff;
int byoff;
unsigned char bitpackPel=0;
unsigned char pel1=0;
unsigned char pel2=0;
unsigned char pel3=0;
unsigned char pel4=0;
unsigned char pel5=0;
unsigned char pel6=0;
unsigned char pel7=0;
unsigned char pel8=0;
int checkX=grayscaleWidth-32;
int checkY=grayscaleHeight;
for ( int by = 0 ; by < checkY; by++)
{
yoff=by*grayscaleWidth;
byoff=by*bitPackedScanlineStride;
for( int bx = 0; bx < checkX; bx+=32)
{
bitpackPel = 0;
//pixel 1 in bitpack image
pel1=grayImage[yoff+bx];
pel2=grayImage[yoff+bx+1];
pel3=grayImage[yoff+bx+2];
pel4=grayImage[yoff+bx+3];
pel5=grayImage[yoff+bx+4];
pel6=grayImage[yoff+bx+5];
pel7=grayImage[yoff+bx+6];
pel8=grayImage[yoff+bx+7];
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]);
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel;
//pixel 2 in bitpack image
pel1=grayImage[yoff+bx+8];
pel2=grayImage[yoff+bx+9];
pel3=grayImage[yoff+bx+10];
pel4=grayImage[yoff+bx+11];
pel5=grayImage[yoff+bx+12];
pel6=grayImage[yoff+bx+13];
pel7=grayImage[yoff+bx+14];
pel8=grayImage[yoff+bx+15];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel;
//pixel 3 in bitpack image
pel1=grayImage[yoff+bx+16];
pel2=grayImage[yoff+bx+17];
pel3=grayImage[yoff+bx+18];
pel4=grayImage[yoff+bx+19];
pel5=grayImage[yoff+bx+20];
pel6=grayImage[yoff+bx+21];
pel7=grayImage[yoff+bx+22];
pel8=grayImage[yoff+bx+23];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel;
//pixel 4 in bitpack image
pel1=grayImage[yoff+bx+24];
pel2=grayImage[yoff+bx+25];
pel3=grayImage[yoff+bx+26];
pel4=grayImage[yoff+bx+27];
pel5=grayImage[yoff+bx+28];
pel6=grayImage[yoff+bx+29];
pel7=grayImage[yoff+bx+30];
pel8=grayImage[yoff+bx+31];
bitpackPel = 0;
/*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] );
bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] );
bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] );
bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] );
bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] );
bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] );
bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] );
bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );*/
bitpackPel |= ( (pel1<=threshold) << 7);
bitpackPel |= ( (pel2<=threshold) << 6 );
bitpackPel |= ( (pel3<=threshold) << 5 );
bitpackPel |= ( (pel4<=threshold) << 4 );
bitpackPel |= ( (pel5<=threshold) << 3 );
bitpackPel |= ( (pel6<=threshold) << 2 );
bitpackPel |= ( (pel7<=threshold) << 1 );
bitpackPel |= ( (pel8<=threshold) );
bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel;
}
}
}
我正在使用(GCC)4.1.2在Intel Xeon 5670上进行测试。在这些规范下,硬编码的比特移位比使用我原来的LUT算法慢4毫秒。在Xeon和GCC中,LUT算法平均为8.61 ms,硬编码位移平均为12.285 ms。
答案 0 :(得分:2)
尝试类似:
unsigned i, w8=w>>3, x;
for (i=0; i<w8; i++) {
x = thres-src[0]>>1&0x80;
x |= thres-src[1]>>2&0x40;
x |= thres-src[2]>>3&0x20;
x |= thres-src[3]>>4&0x10;
x |= thres-src[4]>>5&0x08;
x |= thres-src[5]>>6&0x04;
x |= thres-src[6]>>7&0x02;
x |= thres-src[7]>>8&0x01;
out[i] = x;
src += 8;
}
你可以找出宽度线末端余数的额外代码不是8的倍数,或者你可以填充/对齐源以确保它是8的倍数。
答案 1 :(得分:1)
您可以非常轻松地使用SSE执行此操作,一次处理16个像素,例如
使用SSE内在函数的示例代码(警告:未经测试!):
void threshold_and_pack(
const uint8_t * in_image, // input image, 16 byte aligned, height rows x width cols, width = multiple of 16
uint8_t * out_image, // output image, 2 byte aligned, height rows x width/8 cols, width = multiple of 2
const uint8_t threshold, // threshold
const int width,
const int height)
{
const __m128i vThreshold = _mm_set1_epi8(255 - threshold);
int i, j;
for (i = 0; i < height; ++i)
{
const __m128i * p_in = (__m128i *)&in_image[i * width];
uint16_t * p_out = (uint16_t *)&out_image[i * width / CHAR_BIT];
for (j = 0; j < width; j += 16)
{
__m128i v = _mm_load_si128(p_in);
uint16_t b;
v = _mm_add_epi8(v, vThreshold);
b = _mm_movemask_epi8(v); // use PMOVMSKB to pack sign bits into 16 bit word
*p_out = b;
p_in++;
p_out++;
}
}
}