继续我对optimizing finite-difference code的追求...... 我已经设法通过使用多线宏来制作用于求和相邻细胞差异的通用算法。以前我使用过仿函数,但性能很差。
// T0, T: double[T_size_x*T_size_y*T_size_z]
template <bool periodic_z, bool periodic_y, bool periodic_x>
void process_T(double *T0, double *T, const int &T_size_x, const int &T_size_y, const int &T_size_z) {
double sum, base;
const int dy = (T_size_y-1)*T_size_x;
const int dz = (T_size_z-1)*T_size_x*T_size_y;
int pos = 0; //T_size_x * j;
struct _Diff {
inline void operator() (const int &pos) { sum += T0[pos] - base; }
inline void edge(const int &pos) { sum += 0.5*(T0[pos] -base); }
_Diff (double &s, double &b, double *t0) : sum(s), base(b), T0(t0) {}
private:
double ∑
double &base;
double *T0;
} Diff (sum, base, T0);
#define BigOne(i, periodic, T_size, left, right, start, end) \
if (i > 0) Diff(left); \
if (i < T_size-1) Diff(right); \
if (!periodic) { \
if (i == T_size-1 || i == 1) Diff.edge(left); \
if (i == T_size-2 || i == 0) Diff.edge(right); \
} else { \
if (i == 0) Diff(end); \
if (i == T_size-1) Diff(start); \
}
for (int k = 0; k < T_size_z; ++k) {
for (int j = 0; j < T_size_y; ++j) {
for (int i = 0; i < T_size_x; ++i, ++pos) {
sum = 0;
base = T0[pos];
// process x direction
BigOne(i, periodic_x, T_size_x, pos-1, pos+1, pos - i, pos + T_size_x-1)
// process y direction
BigOne(j, periodic_y, T_size_y, pos-T_size_x, pos+T_size_x, pos - dy, pos + dy)
// process z direction
BigOne(k, periodic_z, T_size_z, pos-T_size_x*T_size_y, pos+T_size_x*T_size_y, pos - dz, pos + dz)
T[pos] = T0[pos] + sum * 0.08; // where 0.08 is some magic number
}
}
}
}
为了使代码更高效,我考虑将仿函数转换为宏。
#define Diff(pos) sum += T0[pos] - base
#define Diff_edge(pos) sum += 0.5*(T0[pos]-base)
// Note: change Diff.edge in BigOne to Diff_edge as well
使用g++
(4.8.2)进行编译而没有进行优化时,正如预期的那样,宏代码运行速度更快。但是当我使用-O2
或-O3
进行编译时,上面的第一个示例突然产生了更好的结果(在15000次迭代中,函数完成在22.7s中,宏在23.7s中)。
为什么会这样?仿函数是否以某种方式作为编译器缓存指令的提示?