Question

我编写了一个小的MPI程序，用2D域上的迭代方案计算：我为每个单元格值（5点模板）应用滤镜。

在我的代码中，我根据进程数将全局域拆分为多个子域。我采用xcell和ycell每个子域的大小。

一旦计算达到所需的收敛，我将每个子域的值存储到 1D数组（参见下面 xtemp数组）之后，我使用{{ 1}}将所有这些子域收集到1D最终数组（MPI_Gather数组）中。在下面的分配中（进入循环）xfinal 表示我要复制到x0 array xtemp的2D子域。

现在，我想知道如果我应用良好的编码规则和优化 （我的意思是从运行时的角度来看，获得最佳性能的方法）在每个版本（语言C和Fortran90）中填充“1D array”数组。 xtemp是当前MPI流程的排名，me是每个子域的坐标。我们还必须注意，我们在语言C和Fortran90版本中使用数组xs,xe,ys,ye。

我做了：

语言C：

the convention (i,j) = (rows,columns)

正如您所看到的，我认为内部索引必须位于列（/* Fill all 1D xtemp array before MPI_Gather all subdomains : inner loop on columns index (second index) to optimize since C is row major */ j=1; for (i=xs[me];i<=xe[me];i++) { for (k=0;k<ycell;k++) xtemp[(j-1)*ycell+k] = x0[i][ys[me]+k]; j=j+1; }）上，因为C语言是行专业。

Fortran90：

ys[me]+k

在这里，我认为内部作业必须在行（! Fill all 1D xtemp array before MPI_Gather all subdomains : ! inner loop on rows index (first index) ! to optimize since F90 is column major i = 1 do j=ys(me),ye(me) xtemp((i-1)*xcell+1:i*xcell) = x0(xs(me):xe(me),j) i = i+1 end do）上，因为F90语言是专栏。

您同意这两个版本的xs(me):xe(me)数组填充，即从优化的角度来看（获得两个版本的最小运行时间，F90和C语言）？< / p>

更新1：

根据我收到的不同评论，我提供了更多信息：

1。语言C版：

填充xtemp 1D数组的

For the language C version：我使用连续的xtemp 2D数组（采用约定x0）：

[i,j]=[row,column]

根据您的指示，以下块有效填写double **x0; x0 = malloc(size_total_x*sizeof(*x0)); x0[0] = malloc(size_total_x*size_total_y*sizeof(**x0)); for (i=1;i<size_total_x;i++) { x0[i] = x0[0] + i*size_total_y; } 1D数组：

xtemp

好像你确认了，不是吗？

2。 Fortran F90版本：

j=1; for (i=xs[me];i<=xe[me];i++) { for (k=0;k<ycell;k++) xtemp[(j-1)*ycell+k] = x0[i][ys[me]+k]; j=j+1; }填充For the F90 version 1D数组：我还使用了连续的xtemp 2D数组（采用约定x0）：

[i,j]=[row,column]

关于上面的这个块，是否是在Fortran90中分配2D连续数组的正确方法？

最后，我没有看到使用Fortran90版本填充double precision, allocatable :: x0(:,:) allocate(x0(0:size_total_x-1,0:size_total_y-1)) 1D数组的评论，即：

xtemp

做得更好（从一开始就从正确的优化角度来看）：

 i = 1                                                                                 
   do j=ys(me),ye(me)
     xtemp((i-1)*xcell+1:i*xcell) = x0(xs(me):xe(me),j)                                  
     i = i+1
   end do

与否 ???

我意识到，在这两个解决方案之间，输出j = 1 do i=xs(me),xe(me) xtemp((j-1)*ycell+1:j*ycell) = x0(i,ys(me):ye(me)) j = j+1 end do 将以不同方式填充（第一个xtemp块和xcell第二个）但我不介意：我只想使用最优化的方式。

感谢您的帮助

Answer 1

以下是对此问题的各种评论的摘要

在Fortran中，或者如果您的数组是C连续数组，另一个选项是使用派生数据类型，因此不需要＆＃34;打包/解压缩＆＃34;来自临时缓冲区的数据。由于数据非常规则（例如2D数组的子集），因此MPI库应该非常有效地处理它。

在Fortran，

i = 1                                                                                 
do j=ys(me),ye(me)
  xtemp((i-1)*xcell+1:i*xcell) = x0(xs(me):xe(me),j)                                  
  i = i+1
end do

可能是将x0数组打包到xtemp临时缓冲区中的最有效方法。数组赋值语法可以看作是一个隐式循环，它迭代第一个索引（例如Fortran中的列），这是最优的。

另一方面，在C，

j=1;
for (i=xs[me];i<=xe[me];i++) {                                                        
  for (k=0;k<ycell;k++)                                                              
    xtemp[(j-1)*ycell+k] = x0[i][ys[me]+k];                                         
  j=j+1;
}

可能是最优的，因为（显式）最内层循环迭代在C中最优的第二索引（例如行）。注意x0数组被声明为锯齿状数组，并且认为是手动构建为连续数组，编译器不知道它。将x0声明为二维数组（例如double x0[xcell][ycell];）可以带来更好的表现。

如果您的应用程序是混合MPI + OpenMP，那么在打包/解包x0到xtemp时，可以通过使用OpenMP获得额外的提升。在这种情况下，它可能比使用派生数据类型更快，因为在大多数MPI实现中处理派生数据类型是一个串行操作。

Answer 2

前一段时间我写了一些预处理器代码来任意分配基于指针的2D和3D数组。我这样做是为了避免访问成员所需的隐式乘法。请注意，您可以使用此方法以线性方式和维度方式访问数组成员。我希望它有所帮助：

// allocate a 2D array: 'var' is the variable it will be assigned to, 'type' is like float, double, char *, etc
// your var should be declared as type **
// array contents are calloc()ed
// you can linearly address members by &a[0][0];
#define new2DArray(var,type,xDim,yDim) \
do { \
      type *arrayPtr; \
      type **x;      \
      int i; \
      type *p; \
\
      arrayPtr=(type *)calloc(xDim*yDim,sizeof(type)); \
\
      if(arrayPtr==NULL) \
      {  \
         var=NULL; \
         break; \
      }  \
\
      x=(type **)malloc(xDim*sizeof(type **)); \
\
      if(x==NULL)  \
      {  \
        free(arrayPtr);  \
         var=NULL; \
         break;  \
      }  \
\
      for(i=0,p=arrayPtr;i<xDim;i++,p+=yDim) \
         x[i]=p;  \
\
      var=x; \
} while(0)


#define free2DArray(array) do \
{  \
  if(array!=NULL) \
  {  \
    if(array[0]!=NULL) \
      free(array[0]);  \
   free(array);  \
  }  \
} while(0)


#define free3DArray(array) do\
{ \
   if(array!=NULL) \
   {  \
      if(array[0]!=NULL)  \
         free2DArray(array[0]); \
      free(array); \
   } \
} while(0)

// allocate a 3d array: 'var' is the variable it will be assigned to, 'type' is a valid C variable type
// your var should be declared as type ***
// array contents are calloc()ed
// you can linearly address members by &a[0][0][0];
#define new3DArray(var,type,xDim,yDim,zDim) do\
{ \
   type *arrayPtr; \
   type ***xy;  \
   int x,y;  \
   type *p;  \
\
   arrayPtr=(type *)calloc(xDim*yDim*zDim,sizeof(type)); \
   if(arrayPtr==NULL) \
      break; \
   new2DArray(xy,type *,xDim,yDim);  \
   if(xy==NULL)  \
   {  \
      free(arrayPtr);  \
      var=NULL; \
      break; \
   }  \
   p=arrayPtr;  \
   for(x=0;x<xDim;x++) \
      for(y=0;y<yDim;y++,p+=zDim) \
         xy[x][y]=p; \
\
   var=xy; \
} while(0)

版本C和Fortran90：在MPI_Gather之前填充子阵列的优化

2 个答案: