使用Client.map()“无法腌制未打开供读取的文件”

时间:2019-04-11 16:14:35

标签: python dask dask-distributed

我正在尝试使用dask.distributed来基于多个CSV文件中的内容同时更新Postgresql数据库。理想情况下,我们将CSV文件分配给N个工作人员,每个工作人员会将CSV文件内容插入数据库中。但是,在向工作人员分配任务时使用Cannot pickle files that are not opened for reading时,会出现Client.map()异常。

这是代码的精简版本:

def _work(csv_path):
   db = Database() # encapsulates interaction w/ postgresql database
   db.open()

   count = 0

   with csv_path.open('r') as csv_file:
       reader = csv.DictReader(csv_file)

       for record in reader:
           db.insert(record)
           count += 1

   db.close()

   return count


client = Client(processes=False)

csv_files = Path('/data/files/').glob('*.csv')

csv_futures = client.map(_work, csv_files) # error occurs here

for finished in  as_completed(csv_futures):
   count = finished.result()
   print(count)

基于相关的stackoverflow和github问题,我成功地使用cloudpickle来序列化和反序列化函数和参数。

cloudpickle.loads(cloudpickle.dumps(_work))
Out[69]: <function _work(csv_path)>

files = list(Path('/data/files/').glob('*.csv'))
files
Out[73]: 
[PosixPath('/data/files/208.csv'),
 PosixPath('/data/files/332.csv'),
 PosixPath('/data/files/125.csv'),
 PosixPath('/data/files/8.csv')]
cloudpickle.loads(cloudpickle.dumps(files))
Out[74]: 
[PosixPath('/data/files/208.csv'),
 PosixPath('/data/files/332.csv'),
 PosixPath('/data/files/125.csv'),
 PosixPath('/data/files/8.csv')]

所以,问题出在其他地方。

1 个答案:

答案 0 :(得分:1)

确切的例外是:

public class TV extends GameObject implements Serializable, IClusterable {
    private static final long serialVersionUID = 1L;

    public TV(String name, boolean activatable, boolean obstacle) {
        super(name, activatable, obstacle);
    }

    @Override
    public void doButtonOneActivity() {
        if(isActivatable()){
            // do whatever I need to do as a TV when I am activated...
           }
           if (isObstacle()){
            // do whatever I need to do as a TV when I am activated as an obstacle...
           }
           System.out.println("I'm a TV and I was called. My name is: " + getName()); 

    }
}

单步调试器,我对public class Wall extends GameObject implements Serializable, IClusterable { private static final long serialVersionUID = 1L; public Wall(String name, boolean activatable, boolean obstacle) { super(name, activatable, obstacle); } @Override public void doButtonOneActivity() { if(isActivatable()){ // do whatever I need to do as a Wall when I am activated... } if (isObstacle()){ // do whatever I need to do as a Wall when I am activated as an obstacle... } System.out.println("I'm a wall and I was called. My name is: " + getName()); } } 是什么很好奇,这就是

File "/Users/may/anaconda/envs/eagle-i/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 841, in save_file raise pickle.PicklingError("Cannot pickle files that are not opened for reading: %s" % obj.mode) _pickle.PicklingError: Cannot pickle files that are not opened for reading: a

在上面给出的示例代码片段中,我没有给记录器打电话,而obj正是在抱怨。日志记录是之前尝试使用dask来并行化此功能的遗留工件。一旦我从传递给<_io.TextIOWrapper name='/tmp/logs/ei_sched.log' mode='a' encoding='UTF-8'>的函数中删除了记录调用,一切就按预期进行了。

顺便说一句,这是cloudpickle的一个很好的收获,因为不应由熟练的工作人员登录到单个文件。