mpi4py:内部错误:无效的错误代码409e0e(Ring ID不匹配)

时间:2014-09-03 00:37:40

标签: parallel-processing mpi mpi4py

我在python中编码并使用mpi4py并行进行一些优化。我使用的是普通的最小二乘法,而且我的数据太大而无法放在一个处理器上,因此我有一个主进程,然后产生其他进程。这些子进程分别导入它们在整个优化过程中分别使用的数据部分。

我使用scipy.optimize.minimize进行优化,因此子进程从父进程接收系数猜测,然后向父进程报告平方误差和(SSE),然后scipy.optimize。最小化经历迭代,试图找到SSE的最小值。在最小化函数的每次迭代之后,父级向子进程广播新的系数猜测,子进程然后再次计算SSE。在子进程中,此算法在while循环中设置。在父进程中,我只需调用scipy.optimize.minimize。

在给我一个问题的部分,我正在进行嵌套优化或优化中的优化。内部优化是如上所述的OLS回归,然后外部优化最小化使用内部优化系数的另一个函数(OLS回归)。

所以在我的父进程中,我有两个最小化的函数,第二个函数调用第一个函数,并对第二个函数的优化的每次迭代进行新的优化。子进程有两个优化的嵌套while循环。

希望一切都有道理。如果需要更多信息,请告诉我。

以下是父流程的相关代码:

comm = MPI.COMM_SELF.Spawn(sys.executable,args = ['IVQTparallelSlave_cdf.py'],maxprocs=processes)

# First stage: reg D on Z, X
def OLS(betaguess):
    comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
    SSE = np.array([0],dtype='d')
    comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
    comm.Bcast([np.array([1],'i'),MPI.INT], root=MPI.ROOT)
    return SSE


# Here is the CDF function.
def CDF(yguess, delta_FS, tau):
    # Calculate W(y) in the slave process
    # Solving the Reduced form after every iteration: reg W(y) on Z, X
    comm.Bcast([yguess,MPI.DOUBLE], root=MPI.ROOT)
    betaguess = np.zeros(94).astype('d')
    ###########
    # This calculates the reduced form coefficient
    coeffs_RF = scipy.minimize(OLS,betaguess,method='Powell')
    # This little block is to get the slave processes to stop
    comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
    SSE = np.array([0],dtype='d')
    comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
    cont = np.array([0],'i')
    comm.Bcast([cont,MPI.INT], root=MPI.ROOT)
    ###########
    contCDF = np.array([1],'i')
    comm.Bcast([contCDF,MPI.INT], root=MPI.ROOT) # This is to keep the outer while loop going

    delta_RF = coeffs_RF.x[1]

    return abs(delta_RF/delta_FS - tau)

########### This one finds Y(1) ##############

betaguess = np.zeros(94).astype('d')

######### First Stage: reg D on Z, X ######### 
coeffs_FS = scipy.minimize(OLS,betaguess,method='Powell')

print coeffs_FS

# This little block is to get the slave processes' while loops to stop
comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
SSE = np.array([0],dtype='d')
comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
cont = np.array([0],'i')
comm.Bcast([cont,MPI.INT], root=MPI.ROOT)

delta_FS = coeffs_FS.x[1]

######### CDF Function ######### 
yguess = np.array([3340],'d')
CDF1 = lambda yguess: CDF(yguess, delta_FS, tau)
y_minned_1 = scipy.minimize(CDF1,yguess,method='Powell')

以下是子进程的相关代码:

#IVQTparallelSlave_cdf.py
comm = MPI.Comm.Get_parent()

.
.
.
# Importing data. The data is the matrices D, and ZX
.
.
.
########### This one finds Y(1) ##############
######### First Stage: reg D on Z, X ######### 
cont = np.array([1],'i')
betaguess = np.zeros(94).astype('d')

# This corresponds to 'coeffs_FS = scipy.minimize(OLS,betaguess,method='Powell')' of the parent process
while cont[0]:
    comm.Bcast([betaguess,MPI.DOUBLE], root=0)

    SSE = np.array(((D - np.dot(ZX,betaguess).reshape(local_n,1))**2).sum(),'d')

    comm.Reduce([SSE,MPI.DOUBLE],None, op=MPI.SUM, root = 0)
    comm.Bcast([cont,MPI.INT], root=0)

if rank==0: print '1st Stage OLS regression done'

######### CDF Function ######### 
cont = np.array([1],'i')
betaguess = np.zeros(94).astype('d')
contCDF = np.array([1],'i')
yguess = np.array([0],'d')

# This corresponds to 'y_minned_1 = spicy.minimize(CDF1,yguess,method='Powell')'
while contCDF[0]:
    comm.Bcast([yguess,MPI.DOUBLE], root=0)
    # This calculates the reduced form coefficient
    while cont[0]: 
        comm.Bcast([betaguess,MPI.DOUBLE], root=0)

        W = 1*(Y<=yguess)*D
        SSE = np.array(((W - np.dot(ZX,betaguess).reshape(local_n,1))**2).sum(),'d')    

        comm.Reduce([SSE,MPI.DOUBLE],None, op=MPI.SUM, root = 0)
        comm.Bcast([cont,MPI.INT], root=0)
        #if rank==0: print cont
    comm.Bcast([contCDF,MPI.INT], root=0)

我的问题是,在经过外部最小化的一次迭代之后,它会吐出以下错误:

Internal Error: invalid error code 409e0e (Ring ids do not match) in MPIR_Bcast_impl:1328
Traceback (most recent call last):
  File "IVQTparallelSlave_cdf.py", line 100, in <module>
    if rank==0: print 'CDF iteration'
  File "Comm.pyx", line 406, in mpi4py.MPI.Comm.Bcast (src/mpi4py.MPI.c:62117)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Bcast(1478).....: MPI_Bcast(buf=0x2409f50, count=1, MPI_INT, root=0, comm=0x84000005) failed
MPIR_Bcast_impl(1328): 

我还没有找到任何关于此信息的信息&#34; ring id&#34;错误或如何解决它。非常感谢帮助。谢谢!

0 个答案:

没有答案