我正在编写一个简单的代码来学习如何定义MPI_Datatype并将其与MPI_Gatherv结合使用。我想确保我可以在一个进程上组合可变长度,动态分配的结构化数据数组,这似乎工作正常,直到我调用MPI_Finalize()。我已经确认这是通过使用print语句和Eclipse PTP调试器(后端是gdb-mi)来解决问题的地方。我的主要问题是,如何摆脱分段错误?
每次运行代码时都不会发生段错误。例如,2或3个进程没有发生,但是当我运行大约4个或更多进程时,往往会定期发生。
此外,当我使用valgrind运行此代码时,不会发生分段错误。但是,我确实从valgrind获得了错误消息,但是当我使用MPI函数时,即使有大量的目标抑制,我也很难理解输出。我也担心如果我使用更多的抑制,我会沉默一个有用的错误信息。
我使用这些标志编译普通代码,所以我在两种情况下都使用C99标准: -ansi -pedantic -Wall -O2 -march = barcelona -fomit-frame-pointer -std = c99 和调试的代码: -ansi -pedantic -std = c99 -Wall -g
两者都使用gcc 4.4 mpicc编译器,并使用带有Open MPI v1.4.5的Red Hat Linux在集群上运行。如果我遗漏了其他重要信息,请告诉我。这是代码,并提前感谢:
//#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//#include <limits.h>
#include "mpi.h"
#define FULL_PROGRAM 1
struct CD{
int int_ID;
double dbl_ID;
};
int main(int argc, char *argv[]) {
int numprocs, myid, ERRORCODE;
#if FULL_PROGRAM
struct CD *myData=NULL; //Each process contributes an array of data, comprised of 'struct CD' elements
struct CD *allData=NULL; //root will dynamically allocate this array to store all the data from rest of the processes
int *p_lens=NULL, *p_disp=NULL; //p_lens stores the number of elements in each process' array, p_disp stores the displacements in bytes
int MPI_CD_size; //stores the size of the MPI_Datatype that is defined to allow communication operations using 'struct CD' elements
int mylen, total_len=0; //mylen should be the length of each process' array
//MAXlen is the maximum allowable array length
//total_len will be the sum of mylen across all processes
// ============ variables related to defining new MPI_Datatype at runtime ====================================================
struct CD sampleCD = {.int_ID=0, .dbl_ID=0.0};
int blocklengths[2]; //this describes how many blocks of identical data types will be in the new MPI_Datatype
MPI_Aint offsets[2]; //this stores the offsets, in bytes(bits?), of the blocks from the 'start' of the datatype
MPI_Datatype block_types[2]; //this stores which built-in data types the blocks are comprised of
MPI_Datatype myMPI_CD; //just the name of the new datatype
MPI_Aint myStruct_address, int_ID_address, dbl_ID_address, int_offset, dbl_offset; //useful place holders for filling the arrays above
// ===========================================================================================================================
#endif
// =================== Initializing MPI functionality ============================
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
// ===============================================================================
#if FULL_PROGRAM
// ================== This part actually formally defines the MPI datatype ===============================================
MPI_Get_address(&sampleCD, &myStruct_address); //starting point of struct CD
MPI_Get_address(&sampleCD.int_ID, &int_ID_address); //starting point of first entry in CD
MPI_Get_address(&sampleCD.dbl_ID, &dbl_ID_address); //starting point of second entry in CD
int_offset = int_ID_address - myStruct_address; //offset from start of first to start of CD
dbl_offset = dbl_ID_address - myStruct_address; //offset from start of second to start of CD
blocklengths[0]=1; blocklengths[1]=1; //array telling it how many blocks of identical data types there are, and the number of entries in each block
//This says there are two blocks of identical data-types, and both blocks have only one variable in them
offsets[0]=int_offset; offsets[1]=dbl_offset; //the first block starts at int_offset, the second block starts at dbl_offset (from 'myData_address'
block_types[0]=MPI_INT; block_types[1]=MPI_DOUBLE; //the first block contains MPI_INT, the second contains MPI_DOUBLE
MPI_Type_create_struct(2, blocklengths, offsets, block_types, &myMPI_CD); //this uses the above arrays to define the MPI_Datatype...an MPI-2 function
MPI_Type_commit(&myMPI_CD); //this is the final step to defining/reserving the data type
// ========================================================================================================================
mylen = myid*2; //each process is told how long its array should be...I used to define that randomly but that just makes things messier
p_lens = (int*) calloc((size_t)numprocs, sizeof(int)); //allocate memory for the number of elements (p_lens) and offsets from the start of the recv buffer(d_disp)
p_disp = (int*) calloc((size_t)numprocs, sizeof(int));
myData = (struct CD*) calloc((size_t)mylen, sizeof(struct CD)); //allocate memory for each process' array
//if mylen==0, 'a unique pointer to the heap is returned'
if(!p_lens) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
if(!p_disp) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
if(!myData) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
for (int k=0; k<numprocs; ++k) {
if(myid==k) {
//printf("\t ID %d has %d entries: { ", myid, mylen);
for(int i=0; i<mylen; ++i) {
myData[i]= (struct CD) {.int_ID=myid*(i+1), .dbl_ID=myid*(i+1)}; //fills data elements with simple pattern
//printf("%d: (%d,%lg) ", i, myData[i].int_ID, myData[i].dbl_ID);
}
//printf("}\n");
}
}
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
MPI_Gather(&mylen, 1, MPI_INT, p_lens, 1, MPI_INT, 0, MPI_COMM_WORLD); //Each process sends root the length of the vector they'll be sending
#if 1
MPI_Type_size(myMPI_CD, &MPI_CD_size); //gets the size of the MPI_Datatype for p_disp
#else
MPI_CD_size = sizeof(struct CD); //using this doesn't change things too much...
#endif
for(int j=0;j<numprocs;++j) {
total_len += p_lens[j];
if (j==0) { p_disp[j] = 0; }
else { p_disp[j] = p_disp[j-1] + p_lens[j]*MPI_CD_size; }
}
if (myid==0) {
allData = (struct CD*) calloc((size_t)total_len, sizeof(struct CD)); //allocate array
if(!allData) { MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE); }
}
MPI_Gatherv(myData, mylen, myMPI_CD, allData, p_lens, p_disp, myMPI_CD, 0, MPI_COMM_WORLD); //each array sends root process their array, which is stored in 'allData'
// ============================== OUTPUT CONFIRMING THAT COMMUNICATIONS WERE SUCCESSFUL=========================================
if(myid==0) {
for(int i=0;i<numprocs;++i) {
printf("\n\tElements from %d on MASTER are: { ",i);
for(int k=0;k<p_lens[i];++k) { printf("%d: (%d,%lg) ", k, (allData+p_disp[i]+k)->int_ID, (allData+p_disp[i]+k)->dbl_ID); }
if(p_lens[i]==0) printf("NOTHING ");
printf("}\n");
}
printf("\n"); //each data element should appear as two identical numbers, counting upward by the process ID
}
// ==========================================================================================================
if (p_lens) { free(p_lens); p_lens=NULL; } //adding this in didn't get rid of the MPI_Finalize seg-fault
if (p_disp) { free(p_disp); p_disp=NULL; }
if (myData) { free(myData); myData=NULL; }
if (allData){ free(allData); allData=NULL; } //the if statement ensures that processes not allocating memory for this pointer don't free anything
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
printf("ID %d: I have reached the end...before MPI_Type_free!\n", myid);
// ====================== CLEAN UP ================================================================================
ERRORCODE = MPI_Type_free(&myMPI_CD); //this frees the data type...not always necessary, but a good practice
for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
MPI_Barrier(MPI_COMM_WORLD); //purely for keeping the output organized by give a delay in time
if(ERRORCODE!=MPI_SUCCESS) { printf("ID %d...MPI_Type_free was not successful\n", myid); MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
else { printf("ID %d...MPI_Type_free was successful, entering MPI_Finalize...\n", myid); }
#endif
ERRORCODE=MPI_Finalize();
for(double temp=0.0;temp<1e7;++temp) temp += exp(-10.0); //NO MPI_Barrier AFTER MPI_Finalize!
if(ERRORCODE!=MPI_SUCCESS) { printf("ID %d...MPI_Finalize was not successful\n", myid); MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
else { printf("ID %d...MPI_Finalize was successful\n", myid); }
return EXIT_SUCCESS;
}
答案 0 :(得分:3)
k上的外环是伪造的,但技术上并不是错误 - 它只是没用。
真正的问题是您对MPI_GATHERV的替换是错误的。如果你通过valgrind,你会看到这样的事情:
==28749== Invalid write of size 2
==28749== at 0x4A086F4: memcpy (mc_replace_strmem.c:838)
==28749== by 0x4C69614: unpack_predefined_data (datatype_unpack.h:41)
==28749== by 0x4C6B336: ompi_generic_simple_unpack (datatype_unpack.c:418)
==28749== by 0x4C7288F: ompi_convertor_unpack (convertor.c:314)
==28749== by 0x8B295C7: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:216)
==28749== by 0x935723C: mca_btl_sm_component_progress (btl_sm_component.c:426)
==28749== by 0x51D4F79: opal_progress (opal_progress.c:207)
==28749== by 0x8B225CA: opal_condition_wait (condition.h:99)
==28749== by 0x8B22718: ompi_request_wait_completion (request.h:375)
==28749== by 0x8B231E1: mca_pml_ob1_recv (pml_ob1_irecv.c:104)
==28749== by 0x955E7A7: mca_coll_basic_gatherv_intra (coll_basic_gatherv.c:85)
==28749== by 0x9F7CBFA: mca_coll_sync_gatherv (coll_sync_gatherv.c:46)
==28749== Address 0x7b1d630 is not stack'd, malloc'd or (recently) free'd
指示MPI_GATHERV以某种方式获得了错误信息。
(其他valgrind警告来自Open MPI中的libltdl,遗憾的是这是不可避免的 - 它是libltdl中的一个错误,另一个来自PLPA,这也是不可避免的,因为它故意这样做[原因不是这里有趣的讨论])
看看你的位移计算,我看到了
total_len += p_lens[j];
if (j == 0) {
p_disp[j] = 0;
} else {
p_disp[j] = p_disp[j - 1] + p_lens[j] * MPI_CD_size;
}
但是MPI聚集位移是以数据类型为单位,而不是字节。所以它应该是:
p_disp[j] = total_len;
total_len += p_lens[j];
进行此更改后,MPI_GATHERV valgrind警告就会消失。
答案 1 :(得分:1)
这个'k'循环外部只是假的。它的主体仅针对k = myid执行(对于每个正在运行的进程,它都是常量)。 k从不在循环内引用(除了与几乎常数的myid进行比较)。
此外,mylen = myid*2;
的行不受欢迎。我建议你把它改成常数。
for (int k=0; k<numprocs; ++k) {
if(myid==k) {
//printf("\t ID %d has %d entries: { ", myid, mylen);
for(int i=0; i<mylen; ++i) {
myData[i]= (struct CD) {.int_ID=myid*(i+1), .dbl_ID=myid*(i+1)}; //fills data elements with simple pattern
//printf("%d: (%d,%lg) ", i, myData[i].int_ID, myData[i].dbl_ID);
}
//printf("}\n");
}
}
,所以(鉴于myid在0和numprocs之间),这整个愚蠢的构造可以简化为:
for(int i=0; i<mylen; ++i) {
myData[i].int_ID=myid*(i+1);
myData[i].dbl_ID=myid*(i+1);
}