我正在尝试使用dask.distributed
来基于多个CSV文件中的内容同时更新Postgresql数据库。理想情况下,我们将CSV文件分配给N个工作人员,每个工作人员会将CSV文件内容插入数据库中。但是,在向工作人员分配任务时使用Cannot pickle files that are not opened for reading
时,会出现Client.map()
异常。
这是代码的精简版本:
def _work(csv_path):
db = Database() # encapsulates interaction w/ postgresql database
db.open()
count = 0
with csv_path.open('r') as csv_file:
reader = csv.DictReader(csv_file)
for record in reader:
db.insert(record)
count += 1
db.close()
return count
client = Client(processes=False)
csv_files = Path('/data/files/').glob('*.csv')
csv_futures = client.map(_work, csv_files) # error occurs here
for finished in as_completed(csv_futures):
count = finished.result()
print(count)
基于相关的stackoverflow和github问题,我成功地使用cloudpickle
来序列化和反序列化函数和参数。
cloudpickle.loads(cloudpickle.dumps(_work))
Out[69]: <function _work(csv_path)>
和
files = list(Path('/data/files/').glob('*.csv'))
files
Out[73]:
[PosixPath('/data/files/208.csv'),
PosixPath('/data/files/332.csv'),
PosixPath('/data/files/125.csv'),
PosixPath('/data/files/8.csv')]
cloudpickle.loads(cloudpickle.dumps(files))
Out[74]:
[PosixPath('/data/files/208.csv'),
PosixPath('/data/files/332.csv'),
PosixPath('/data/files/125.csv'),
PosixPath('/data/files/8.csv')]
所以,问题出在其他地方。
答案 0 :(得分:1)
确切的例外是:
public class TV extends GameObject implements Serializable, IClusterable {
private static final long serialVersionUID = 1L;
public TV(String name, boolean activatable, boolean obstacle) {
super(name, activatable, obstacle);
}
@Override
public void doButtonOneActivity() {
if(isActivatable()){
// do whatever I need to do as a TV when I am activated...
}
if (isObstacle()){
// do whatever I need to do as a TV when I am activated as an obstacle...
}
System.out.println("I'm a TV and I was called. My name is: " + getName());
}
}
单步调试器,我对public class Wall extends GameObject implements Serializable, IClusterable {
private static final long serialVersionUID = 1L;
public Wall(String name, boolean activatable, boolean obstacle) {
super(name, activatable, obstacle);
}
@Override
public void doButtonOneActivity() {
if(isActivatable()){
// do whatever I need to do as a Wall when I am activated...
}
if (isObstacle()){
// do whatever I need to do as a Wall when I am activated as an obstacle...
}
System.out.println("I'm a wall and I was called. My name is: " + getName());
}
}
是什么很好奇,这就是
File "/Users/may/anaconda/envs/eagle-i/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 841, in save_file
raise pickle.PicklingError("Cannot pickle files that are not opened for reading: %s" % obj.mode)
_pickle.PicklingError: Cannot pickle files that are not opened for reading: a
在上面给出的示例代码片段中,我没有给记录器打电话,而obj
正是在抱怨。日志记录是之前尝试使用dask来并行化此功能的遗留工件。一旦我从传递给<_io.TextIOWrapper name='/tmp/logs/ei_sched.log' mode='a' encoding='UTF-8'>
的函数中删除了记录调用,一切就按预期进行了。
顺便说一句,这是cloudpickle
的一个很好的收获,因为不应由熟练的工作人员登录到单个文件。