全部。我在使用“ mpi”作为后端时启动分布式程序时遇到问题。程序如下:
def run(rank, size, hostname):
print("I am {} of {} in {}".format(rank, size, hostname))
tensor = torch.zeros(1)
group=dist.new_group([0,1,2])
if rank == 0:
scatter_list=[torch.zeros(1) for _ in range(3)]
dist.scatter(tensor= tensor, src=0, scatter_list=scatter_list, group=group)
print("Master has completed Scatter")
else:
tensor += 1
dist.scatter(tensor= tensor, src=0, group=group)
print("worker has completed scatter")
print('Rank', rank, 'has data', tensor[0])
def init_process(rank, size, hostname, fn, backend='tcp'):
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size,hostname)
if __name__ == "__main__":
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
hostname = socket.gethostname()
p = Process(target = init_process,
args=(world_rank, world_size, hostname, run, 'mpi'))
p.start()
p.join()
但是,在程序启动时,它总是会引发如下错误:
File "mpi_test.py", line 17, in run
dist.scatter(tensor= tensor, src=0, group=group)
TypeError: scatter() missing 1 required positional argument: 'scatter_list'
错误由排名1和2发出,不需要参数:“ scatter_list”。 我尝试了很多方法,但是失败了。有人知道为什么吗? 感谢您的阅读。