Dask:如何将我的代码与dask延迟并行化?

时间:2017-03-02 08:42:39

标签: multithreading python-3.x parallel-processing python-multiprocessing dask

这是我第一次尝试并行处理,我一直在研究Dask,但实际编码却遇到了麻烦。

我已经看过他们的示例和文档,我认为dask.delayed效果最好。我试图用延迟(function_name)包装我的函数,或者添加一个@delayed装饰器,但我似乎无法让它正常工作。我更喜欢Dask而不是其他方法,因为它是用python制作的,并且它(假设的)简单。我知道dask在for循环中不起作用,但是他们说它可以在循环内工作。

我的代码通过一个包含其他函数输入的函数传递文件,如下所示:

from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
    name = name.split('.')[0]
    ....

然后做一些预处理ex:

    preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)

然后我调用一个构造函数并将pre_results传递给函数调用:

    fc = FunctionCalls()
    Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
                             input_data=pre_result1, model1=pre_result2)

我在这里做的是将文件传递给for循环,进行一些预处理,然后将文件传递给两个模型。

关于如何并行化这个的想法或提示?我开始得到奇怪的错误,我不知道如何修复代码。代码确实有效。我使用了一堆pandas数据帧,系列和numpy数组,我宁愿不回去更改所有内容以使用dask.dataframes等。

我的评论中的代码可能难以阅读。这是一种更加格式化的方式。

在下面的代码中,当我输入print(mean_squared_error)时,我得到:延迟('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')

from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']

for count, name in enumerate(filenames):
    file1 = pd.read_csv(name)
    df = pd.DataFrame(file1)
    prediction = df['Close'][:-1]
    observed = df['Close'][1:]
    mean_squared_error = delayed(mse)(observed, prediction)

2 个答案:

答案 0 :(得分:26)

您需要调用dask.compute以最终计算结果。请参阅dask.delayed documentation

顺序代码

import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]

results = []
for count, name in enumerate(filenames):
    file1 = pd.read_csv(name)
    df = pd.DataFrame(file1)  # isn't this already a dataframe?
    prediction = df['Close'][:-1]
    observed = df['Close'][1:]
    mean_squared_error = mse(observed, prediction)  
    results.append(mean_squared_error)

并行代码

import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]

delayed_results = []
for count, name in enumerate(filenames):
    df = dask.delayed(pd.read_csv)(name)
    prediction = df['Close'][:-1]
    observed = df['Close'][1:]
    mean_squared_error = dask.delayed(mse)(observed, prediction)
    delayed_results.append(mean_squared_error)

results = dask.compute(*delayed_results)

答案 1 :(得分:1)

此代码段是一个比公认的答案更清晰的解决方案,IMO。

.flatMap(Optional::stream)