在使用相对昂贵的创建资源或对象进行计算的数据集中分布任务的最佳方法是什么。
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
我计划将其与带有SGECluster的dask_jobqueue一起使用。
答案 0 :(得分:1)
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda