Question

当我尝试发送带有“大”数组的MPI派生数据类型（每个有10个浮点数的2个数组）时，我的程序会出现段错误。它通常使用较小的阵列运行。

以下是一个可重复的小例子。这个小程序通过以下MPI实现进行了段落错误： IntelMPI ， BullXMPI 。它与 OpenMPI 和 PlatformMPI 一起正常工作。这是一个带有示例回溯的日志：http://pastebin.com/FMBpCuj2

将mpi_send更改为mpi_ssend无济于事。但是，mpi_send只有一个更大的2 * 100 000个浮点数就可以了。在我看来，这指出了派生数据类型的问题。

program struct 
include 'mpif.h' 

type Data
  integer :: id
  real, allocatable :: ratio(:)
  real, allocatable :: winds(:)
end type 

type (Data) :: test
integer :: datatype, oldtypes(3), blockcounts(3) 
integer :: offsets(3)
integer :: numtasks, rank, i,  ierr 
integer :: n, status(mpi_status_size)

call mpi_init(ierr) 
call mpi_comm_rank(mpi_comm_world, rank, ierr) 
call mpi_comm_size(mpi_comm_world, numtasks, ierr) 

if (numtasks /= 2) then
  write (*,*) "Needs 2 procs"
  call exit(1)
endif

n = 100000
allocate(test%ratio(n))
allocate(test%winds(n))
if (rank == 0) then
  test%ratio = 6
  test%winds = 7
  test%id = 2
else
  test%id = 0
  test%ratio = 0
  test%winds = 0
endif

call mpi_get_address(test%id, offsets(1), ierr)
call mpi_get_address(test%ratio, offsets(2), ierr)
call mpi_get_address(test%winds, offsets(3), ierr)

do i = 2, size(offsets)
  offsets(i) = offsets(i) - offsets(1)
end do
offsets(1) = 0

oldtypes = (/mpi_integer, mpi_real, mpi_real/)
blockcounts = (/1, n, n/)

call mpi_type_struct(3, blockcounts, offsets, oldtypes, datatype, ierr) 
call mpi_type_commit(datatype, ierr) 

if (rank == 0) then 
  !call mpi_ssend(test, 1, datatype, 1, 0,  mpi_comm_world, ierr) 
  call mpi_send(test, 1, datatype, 1, 0,  mpi_comm_world, ierr) 
else
  call mpi_recv(test, 1, datatype, 0, 0,  mpi_comm_world, status, ierr) 
end if

print *, 'rank= ',rank
print *, 'data= ',test%ratio(1:5),test%winds(1:5)

deallocate (test%ratio)
deallocate (test%winds)
call mpi_finalize(ierr) 


end

注意：不同MPI实现之间的比较并不客观，因为测试并非都在同一台机器上（其中一些是超级计算机）。不过，我认为它不应该有所作为。

编辑：代码适用于静态数组。这是Fortran 90。

Answer 1

我可以建议你使用调试器吗？我刚刚在Allinea DDT中尝试了您的示例，并在两分钟内看到了问题。您需要使用调试器 - 您的代码＆＃34;看起来正确＆＃34;，所以现在是时候观察它在实践中的表现。

我点击打开内存调试（一种强制显示一些隐藏错误的方法），然后你的示例每次都与OpenMPI崩溃。崩溃发生在发件人身上。

所以，我开始使用滴滴涕 - 开启了滴滴涕的内存调试。

首先，调用MPI_Get_address - 填充一组偏移量。看看那些补偿！整数的地址是正数，可分配的数组偏移是负数：一个坏符号。地址已经溢出。

分配的数据的地址将与静态分配的整数位于非常不同的内存区域。如果你使用32位算术来操作64位指针（MPI_Get_address警告这一点），所有的赌注都会关闭。对于静态数组，它没有崩溃，因为它们的地址足够接近整数的地址而不会溢出。

您将此不正确的偏移量数组发送到MPI_Send，它会从不应该的位置读取数据（再次查看偏移缓冲区以说服自己），从而判断为段错误。

这里真正的解决方法是 -

使用MPI_Get_address - 使用INTEGER（KIND = MPI_ADDRESS_KIND）来声明偏移量 - 以确保64位代码获得64位整数。
MPI_type_struct应替换为MPI_type_create_struct - 前者已被弃用，并且不会以MPI_ADDRESS_KIND整数的形式获取偏移量，只有4字节整数 - 因此存在缺陷。

通过这些更改，您的代码就会运行。

MPI使用派生数据类型发送错误（Fortran）

1 个答案: