使用dask map_partition卡在无限错误循环中

时间:2018-01-15 14:57:43

标签: python pandas amazon-ec2 python-multiprocessing dask

我试图通过dask在pandas数据帧上并行化一个函数。在我的本地Windows机器(python 3)上我没有问题,但是当我尝试使用ubuntu 16和python 2或3在我的远程aws机器上运行时,我收到以下错误:

dask版本0.16.1

import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get

def process_frame(x):
            # extract and process wav files
            return process_wav_file(x)
            # or even return 0

dfdd = dd.from_pandas(existing_df, npartitions=4)
results = dfdd.map_partitions(lambda df: 
                              df.wav_file.apply(process_frame) ,meta=('x', 'f8')).compute(get=get)  
上面代码中的

wav_file列保存了wav文件的路径。我得到的错误如下,它不断重复像无限循环:

Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/site-packages/dask/multiprocessing.py", line 192, in initialize_worker_process
    np.random.seed()
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Process ForkPoolWorker-2:
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
TypeError: 'int' object is not callable
Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Process ForkPoolWorker-3:
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/site-packages/dask/multiprocessing.py", line 192, in initialize_worker_process
    np.random.seed()
Traceback (most recent call last):
TypeError: 'int' object is not callable
Process ForkPoolWorker-4:
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/site-packages/dask/multiprocessing.py", line 192, in initialize_worker_process
    np.random.seed()
Traceback (most recent call last):
Process ForkPoolWorker-5:
TypeError: 'int' object is not callable
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/home/ubuntu/anaconda2/envs/cx/lib/python3.6/site-packages/dask/multiprocessing.py", line 192, in initialize_worker_process
    np.random.seed()
Traceback (most recent call last):
Process ForkPoolWorker-6:

0 个答案:

没有答案