我的代码为
def add_outlinks(arr, source):
for dest in arr:
if int(dest) in _local_dict:
_local_dict[dest].in_links.append(int(source))
rpc.init_rpc(my_name, rank=rank, world_size=size,rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,rpc_timeout=datetime.timedelta(seconds=10000))) # initial_rpc
#CALL rpc TO OTHER RANKS
if rank==0:
print("add-link...")
try:
array_rpc = list(range(0, size))
count=0
for it in _local_dict:
count = count+1
arr_send = []
for i in range(0, size):
arr_send.append([])
u = _local_dict[it]
source = u.vertexId
for i in u.links:
arr_send[int(i) % size].append(int(i))
for i in array_rpc:
my_target = "worker" + str(i)
if len(arr_send[i])>0:
rpc.rpc_async(my_target, add_outlinks, args=(arr_send[i],source))
except:
print("rank ",rank," run ",count,"/",len(_local_dict))
rpc.api._wait_all_workers()
print("shutdown.... rpc... ", rank)
rpc.api._wait_all_workers()
rpc.shutdown()
1。 10000秒后,要处理多少个对象?
->输出如下。在输出中看不到worker0。我尝试打印“计数”值。但是没有“ count”变量的输出。
....
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:45462
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
....
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:19635
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:27553
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
Traceback (most recent call last):
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "pagerank.py", line 380, in init_process
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
...
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
_wait_all_workers()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:5848: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:50095: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:29331
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
args=(sequence_id, self_worker_name,),
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:15236
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2720
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:23214
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:38547
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:50607
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
...
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:12173
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
return fut.wait()
RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2693
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
During handling of the above exception, another exception occurred:
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9877
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:36780: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:3213: Connection reset by peer
Traceback (most recent call last):
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:13723
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "pagerank.py", line 380, in init_process
print("shutdown.... rpc... ", rank)
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9140
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
_wait_all_workers()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
args=(sequence_id, self_worker_name,),
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
return fut.wait()
RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:9889: Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
KeyboardInterrupt
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:34716
我将代码更改为将计数打印为
for it in _local_dict:
if(rank==0:)
count = count+1
print(count)
-> _local_dict中的所有元素都将运行。但是,该程序由于RPC超时而停止。这意味着将调用所有rpc并等待完成。
2。在这种情况下是否还使用分布式autograd / optimizer?
->尚未。在https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html ??
3。这些CPU中是否有位于同一台计算机上的CPU? (这样shm可能会有所帮助) ->我在这台机器上使用32个CPU。 (我在计算机上的所有CPU。)
我尝试使用较小的数据(_local_dict中的〜60个元素),并且有效。 也许,许多rpc调用有问题吗?
对于真实数据,_local_dict的大小接近190000个元素-32个工人。对于每个元素,我都为rpc(...)调用了32次。
对于每个工人,我们必须为rpc(...)调用190000 * 32次。我们有32名工人。因此,总共有190000 * 32 * 32次调用RPC。
python中的rpc-pytorch或字典数据类型有问题吗?
谢谢