我正在使用Pytorch框架,需要创建一个自定义同步过程。我能想到的最好的办法是创建一个线程,等待发送方节点发送各自的张量值。但是问题是发送节点陷入dist.send()
命令中。
这是我正在测试的经过修剪的代码,它代表实际代码执行的操作:
def run(rank_local, rank, world_size):
print("I WAS SPAWNED ", rank_local, " OF ", rank)
tensor_1 = torch.zeros(1)
tensor_1 += 1
while True:
print("I am spawn of: ", rank, "and my tensor value before receive: ", tensor_1[0])
nt = dist.recv(tensor_1)
print("I am spawn of: ", rank, "and my tensor value after receive from", nt, " is: ", tensor_1[0])
def communication(tensor, rank):
if rank != 0:
tensor += (100 + rank)
dist.send(tensor, dst=0)
else:
tensor -= 1000
dist.send(tensor, dst=1)
print("I AM DONE WITH MY SENDS NOW WHAT?: ", rank)
if __name__ == '__main__':
# Initialize Process Group
dist.init_process_group(backend="mpi", group_name="main")
# get current process information
world_size = dist.get_world_size()
rank = dist.get_rank()
#torch.cuda.set_device(rank%2)
# Establish Local Rank and set device on this node
p = ThreadWithReturnValue(target=run, args=(0, rank, world_size)) #mp.Process(target=run, args=(0, rank, world_size))
p.start()
tensor = torch.zeros(1)
communication(tensor, rank)
p.join()
我不知道如何解决此问题。请注意,当我删除行torch.cuda.set_device(rank%2)
但我想在我的GPU上运行我的模型时,代码可以正常工作。有什么想法吗?