我正在尝试优化非常大的图像的旋转,最小的是4096x4096或~1600万像素。
旋转始终是图像的中心,图像不一定是方形的,但总是2的幂。
我可以访问MKL / TBB,其中MKL是针对我的目标平台的优化BLAS。如果这个操作完全在BLAS中,我不熟悉。
到目前为止,我最好的尝试是大约17-25毫秒(对于相同的图像大小非常不一致,这意味着我可能在整个缓存中踩踏)对于4096x4096图像。矩阵是16字节对齐的。
现在,目的地无法调整大小。因此,裁剪应该并且可以发生。例如,以45度旋转的方形矩阵肯定会在拐角处剪切,该位置的值应该为零。
现在,我最好的尝试使用平铺方法 - 尚未将优雅放入平铺尺寸或循环展开。
这是我使用TBB的算法 - http://threadingbuildingblocks.org/:
//- cosa = cos of the angle
//- sina = sin of angle
//- for those unfamiliar with TBB, this is giving me blocks of rows or cols that
//- are unique per thread
void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
double xOffset;
double yOffset;
int lineOffset;
int srcX;
int srcY;
for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
{
const size_t colBegin = r.cols().begin();
xOffset = -(row * sina) + xHelper + (cosa * colBegin);
yOffset = (row * cosa) + yHelper + (sina * colBegin);
lineOffset = ( row * rowSpan ); //- all col values are offsets of this row
for( size_t col = colBegin; col != r.cols().end(); ++col, xOffset += cosa, yOffset += sina )
{
srcX = xOffset;
srcY = yOffset;
if( srcX >= 0 && srcX < colSpan && srcY >= 0 && srcY < rowSpan )
{
destData[col + lineOffset] = srcData[srcX + ( srcY * rowSpan )];
}
}
}
}
我这样调用这个函数:
double sina = sin(angle);
double cosa = cos(angle);
double centerX = (colSpan) / 2;
double centerY = (rowSpan) / 2;
//- Adding .5 for rounding
const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5;
const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5;
tbb::parallel_for( tbb::blocked_range2d<size_t, size_t>( 0, rowSpan, 0, colSpan ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, (fcomplex *)pDestData, (fcomplex *)pSrcData ) );
fcomplex只是复杂数字的内部表示。它被定义为:
struct fcomplex
{
float real;
float imag;
};
所以,我想尽可能快地旋转一个复杂值矩阵,围绕它的中心以任意角度拍摄非常大的图像。
更新
基于精彩的反馈,我已经更新到了:大约增加了40%。我想知道是否还有其他事情可以做到:
void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
float xOffset;
float yOffset;
int lineOffset;
__m128i srcXints;
__m128i srcYints;
__m128 dupXOffset;
__m128 dupYOffset;
for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
{
const size_t colBegin = r.cols().begin();
xOffset = -(row * sina) + xHelper + (cosa * colBegin);
yOffset = (row * cosa) + yHelper + (sina * colBegin);
lineOffset = ( row * rowSpan ); //- all col values are offsets of this row
for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += dupOffsetsX.m128_f32[3], yOffset += dupOffsetsY.m128_f32[3] )
{
dupXOffset = _mm_load1_ps(&xOffset); //- duplicate the x offset 4 times into a 4 float field
dupYOffset = _mm_load1_ps(&yOffset); //- duplicate the y offset 4 times into a 4 float field
srcXints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) );
srcYints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) );
if( srcXints.m128i_i32[0] >= 0 && srcXints.m128i_i32[0] < colSpan && srcYints.m128i_i32[0] >= 0 && srcYints.m128i_i32[0] < rowSpan )
{
destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * rowSpan )];
}
if( srcXints.m128i_i32[1] >= 0 && srcXints.m128i_i32[1] < colSpan && srcYints.m128i_i32[1] >= 0 && srcYints.m128i_i32[1] < rowSpan )
{
destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * rowSpan )];
}
if( srcXints.m128i_i32[2] >= 0 && srcXints.m128i_i32[2] < colSpan && srcYints.m128i_i32[2] >= 0 && srcYints.m128i_i32[2] < rowSpan )
{
destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * rowSpan )];
}
if( srcXints.m128i_i32[3] >= 0 && srcXints.m128i_i32[3] < colSpan && srcYints.m128i_i32[3] >= 0 && srcYints.m128i_i32[3] < rowSpan )
{
destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * rowSpan )];
}
}
}
}
更新2: 我在下面提出了一个解决方案,考虑到我得到的建议作为答案以及在旋转矩形时修复错误。
答案 0 :(得分:3)
如果您首先执行简单的近似旋转(90/190/270)度,然后在0-90度之间进行最终旋转,则可以优化相当多的事情。例如。然后,您可以优化if( srcX >= 0 && srcX < colSpan && srcY >= 0 && srcY < rowSpan )
测试,它将更加缓存友好。我敢打赌,你的分析显示91度旋转比1度旋转慢很多。
答案 1 :(得分:1)
优化并不多。你的算法很健全。您正在逐行写入dstData(这对缓存/内存有用),强制每个线程进行顺序写入。
唯一剩下的就是循环展开你的内部...循环~4x(对于32位系统)或8x(对于64位系统)。这可能会让你大约提高10-20%。由于问题的性质(强制从srcData中随机读取),你的时间总是会有变化。
我会进一步思考......
你的内在循环是一个强大的矢量化目标。 考虑静态向量:
// SSE instructions MOVDDUP (move and duplicate) MULPD (multiply packed double)
double[] vcosa = [cosa, cosa, cosa, cosa] * [1.0, 2.0, 3.0, 4.0]
double[] vsina = [sina, sina, sina, sina] * [1.0, 2.0, 3.0, 4.0]
vxOffset = [xOffset, xOffset, xOffset, xOffset]
vyOffset = [yOffset, yOffset, yOffset, yOffset]
// SSE instructions ADDPD (add packed double) and CVTPD2DQ (convert packed double to signed integer)
vsrcX = vcosa + vxOffset
vsrcY = vsina + vyOffset
x86的SSE指令非常适合处理此类数据。甚至从双打转换为整数。允许256位向量(4个双精度)的AVX指令更适合。
答案 2 :(得分:0)
考虑到所提出的建议,我已经达成了这个解决方案。另外,我修复了原始实现中的一个错误,导致矩形图像出现问题。
首先旋转90度的建议(使用仿射变换和线程化,然后旋转较小的程度,证明从必须迭代矩阵两次变慢)。当然,有许多因素可以发挥作用,而且很可能内存带宽会导致事情变得更加歪斜。因此,对于我正在测试和优化的机器,这个解决方案被证明是我能提供的最佳解决方案。
使用16x16磁贴:
class DoRotate
{
const double sina;
const double cosa;
const double xHelper;
const double yHelper;
const int rowSpan;
const int colSpan;
mutable fcomplex *destData;
const fcomplex *srcData;
const float *offsetsX;
const float *offsetsY;
__m128 dupOffsetsX;
__m128 dupOffsetsY;
public:
void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
float xOffset;
float yOffset;
int lineOffset;
__m128i srcXints;
__m128i srcYints;
__m128 dupXOffset;
__m128 dupYOffset;
for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
{
const size_t colBegin = r.cols().begin();
xOffset = -(row * sina) + xHelper + (cosa * colBegin);
yOffset = (row * cosa) + yHelper + (sina * colBegin);
lineOffset = ( row * colSpan ); //- all col values are offsets of this row
for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += (4 * cosa), yOffset += (4 * sina) )
{
dupXOffset = _mm_load1_ps(&xOffset); //- duplicate the x offset 4 times into a 4 float field
dupYOffset = _mm_load1_ps(&yOffset); //- duplicate the y offset 4 times into a 4 float field
srcXints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) );
srcYints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) );
if( srcXints.m128i_i32[0] >= 0 && srcXints.m128i_i32[0] < colSpan && srcYints.m128i_i32[0] >= 0 && srcYints.m128i_i32[0] < rowSpan )
{
destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * colSpan )];
}
if( srcXints.m128i_i32[1] >= 0 && srcXints.m128i_i32[1] < colSpan && srcYints.m128i_i32[1] >= 0 && srcYints.m128i_i32[1] < rowSpan )
{
destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * colSpan )];
}
if( srcXints.m128i_i32[2] >= 0 && srcXints.m128i_i32[2] < colSpan && srcYints.m128i_i32[2] >= 0 && srcYints.m128i_i32[2] < rowSpan )
{
destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * colSpan )];
}
if( srcXints.m128i_i32[3] >= 0 && srcXints.m128i_i32[3] < colSpan && srcYints.m128i_i32[3] >= 0 && srcYints.m128i_i32[3] < rowSpan )
{
destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * colSpan )];
}
}
}
}
DoRotate( const double pass_sina, const double pass_cosa, const double pass_xHelper, const double pass_yHelper,
const int pass_rowSpan, const int pass_colSpan, const float *pass_offsetsX, const float *pass_offsetsY,
fcomplex *pass_destData, const fcomplex *pass_srcData ) :
sina(pass_sina), cosa(pass_cosa), xHelper(pass_xHelper), yHelper(pass_yHelper),
rowSpan(pass_rowSpan), colSpan(pass_colSpan),
destData(pass_destData), srcData(pass_srcData)
{
dupOffsetsX = _mm_load_ps(pass_offsetsX); //- load the offset X array into one aligned 4 float field
dupOffsetsY = _mm_load_ps(pass_offsetsY); //- load the offset X array into one aligned 4 float field
}
};
然后调用代码:
double sina = sin(radians);
double cosa = cos(radians);
double centerX = (colSpan) / 2;
double centerY = (rowSpan) / 2;
//- Adding .5 for rounding to avoid periodicity
const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5;
const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5;
float *offsetsX = (float *)_aligned_malloc( sizeof(float) * 4, 16 );
offsetsX[0] = 0.0f;
offsetsX[1] = cosa;
offsetsX[2] = cosa * 2.0;
offsetsX[3] = cosa * 3.0;
float *offsetsY = (float *)_aligned_malloc( sizeof(float) * 4, 16 );
offsetsY[0] = 0.0f;
offsetsY[1] = sina;
offsetsY[2] = sina * 2.0;
offsetsY[3] = sina * 3.0;
//- tiled approach. Works better, but not by much. A little more stays in cache
tbb::parallel_for( tbb::blocked_range2d<size_t, size_t>( 0, rowSpan, 16, 0, colSpan, 16 ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, offsetsX, offsetsY, (fcomplex *)pDestData, (fcomplex *)pSrcData ) );
_aligned_free( offsetsX );
_aligned_free( offsetsY );
我绝不是100%肯定这是最好的答案。但是,这是我能提供的最好的。所以,我想我会把我的解决方案传递给社区。 p>