带有大量数据的火炬

时间:2020-06-11 05:36:29

标签: python multithreading pytorch bigdata

我的代码为

def add_outlinks(arr, source):
    for dest in arr:
        if int(dest) in _local_dict:
            _local_dict[dest].in_links.append(int(source))
    rpc.init_rpc(my_name, rank=rank, world_size=size,rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,rpc_timeout=datetime.timedelta(seconds=10000)))  # initial_rpc

    #CALL rpc TO OTHER RANKS
    if rank==0:
        print("add-link...")
    try:
        array_rpc = list(range(0, size))
        count=0
        for it in _local_dict:
            count = count+1
            arr_send = []
            for i in range(0, size):
                arr_send.append([])
            u = _local_dict[it]
            source = u.vertexId

            for i in u.links:
                arr_send[int(i) % size].append(int(i))
            for i in array_rpc:
                my_target = "worker" + str(i)
                if len(arr_send[i])>0:
                    rpc.rpc_async(my_target, add_outlinks, args=(arr_send[i],source))
    except:
       print("rank ",rank," run ",count,"/",len(_local_dict))
    rpc.api._wait_all_workers()
    print("shutdown.... rpc... ", rank)
    rpc.api._wait_all_workers()
    rpc.shutdown()

  • arr_send [i]将发送至排名i 对于_local_dict中的元素,我们可以并行运行。

1。 10000秒后,要处理多少个对象?

->输出如下。在输出中看不到worker0。我尝试打印“计数”值。但是没有“ count”变量的输出。

....
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:45462
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
....

[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:19635
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:27553
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
Traceback (most recent call last):
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "pagerank.py", line 380, in init_process
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
...
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
    _wait_all_workers()
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:5848: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:50095: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:29331
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
    args=(sequence_id, self_worker_name,),
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:15236
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2720
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:23214
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:38547
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:50607
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
...
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:12173
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
    return fut.wait()
RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2693
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure

During handling of the above exception, another exception occurred:

[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9877
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:36780: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer

[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:3213: Connection reset by peer
Traceback (most recent call last):
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:13723
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "pagerank.py", line 380, in init_process
    print("shutdown.... rpc... ", rank)
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9140
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
    _wait_all_workers()
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
    args=(sequence_id, self_worker_name,),
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
    return fut.wait()

RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:9889: Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
  File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
KeyboardInterrupt
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:34716



我将代码更改为将计数打印为

for it in _local_dict:
     if(rank==0:)
            count = count+1
           print(count)

-> _local_dict中的所有元素都将运行。但是,该程序由于RPC超时而停止。这意味着将调用所有rpc并等待完成。

2。在这种情况下是否还使用分布式autograd / optimizer?

->尚未。在https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html ??

3。这些CPU中是否有位于同一台计算机上的CPU? (这样shm可能会有所帮助) ->我在这台机器上使用32个CPU。 (我在计算机上的所有CPU。)

我尝试使用较小的数据(_local_dict中的〜60个元素),并且有效。 也许,许多rpc调用有问题吗?

对于真实数据,_local_dict的大小接近190000个元素-32个工人。对于每个元素,我都为rpc(...)调用了32次。

对于每个工人,我们必须为rpc(...)调用190000 * 32次。我们有32名工人。因此,总共有190000 * 32 * 32次调用RPC。

python中的rpc-pytorch或字典数据类型有问题吗?

谢谢

0 个答案:

没有答案