为什么Dask / Distributed工作中的计算速度要慢得多?

时间:2017-12-12 16:10:16

标签: python distributed dask

与在本地运行Dask / Distributed worker相比,我的计算运行速度要慢得多。我可以在没有任何I / O的情况下重现它,所以我可以排除它与传输数据有关。以下代码是一个最小的再现示例:

import time
import pandas as pd
import numpy as np
from dask.distributed import Client, LocalCluster


def gen_data(N=5000000):
    """ Dummy data generator """
    df = pd.DataFrame(index=range(N))
    for c in range(10):
        df[str(c)] = np.random.uniform(size=N)
    df["id"] = np.random.choice(range(100), size=len(df))
    return df


def do_something_on_df(df):
    """ Dummy computation that contains inplace mutations """
    for c in range(df.shape[1]):
        df[str(c)] = np.random.uniform(size=df.shape[0])
    return 42


def run_test():
    """ Test computation """
    df = gen_data()
    for key, group_df in df.groupby("id"):
        do_something_on_df(group_df)


class TimedContext(object):
    def __enter__(self):
        self.t1 = time.time()

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.t2 = time.time()
        print(self.t2 - self.t1)

if __name__ == "__main__":
    client = Client("tcp://10.0.2.15:8786")

    with TimedContext():
        run_test()

    with TimedContext():
        client.submit(run_test).result()

在本地运行测试计算大约需要10秒,但在Dask / Distributed中需要大约30秒。我还注意到Dask / Distributed worker输出了很多日志消息,比如

distributed.core - WARNING - Event loop was unresponsive for 1.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.25s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.91s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.99s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.50s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.90s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 2.23s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...

这是令人惊讶的,因为在这个例子中持有GIL的内容尚不清楚。

为什么会有这么大的性能差异?我能做些什么来获得相同的表现?

免责声明:自我回答用于文档目的......

1 个答案:

答案 0 :(得分:12)

这种行为是熊猫非常令人惊讶的行为的结果。默认情况下,Pandas __setitem__处理程序执行检查以检测链式分配,从而导致着名的SettingWithCopyWarning。处理副本时,这些检查会发出对gc.collect here的来电。因此,过度使用__setitem__的代码会导致gc.collect次调用过多。这通常会对性能产生重大影响,但是在Dask / Distributed工作者中问题要严重得多,因为与独立运行相比,垃圾收集必须处理更多的Python数据结构。很可能隐藏的垃圾收集调用也是GIL持有警告的来源。

因此,解决方案是避免这些过多的gc.collect调用。有两种方法:

  • 避免在副本上使用__setitem__:可以说是最好的解决方案,但需要了解副本的生成位置。在上面的示例中,可以通过将函数调用更改为do_something_on_df(group_df.copy())
  • 来实现
  • 禁用链式分配检查:只需将pd.options.mode.chained_assignment = None放在计算开头,也会禁用gc.collect次呼叫。

在这两种情况下,测试计算运行速度比以前快得多,在本地和Dask / Distributed下运行约3.5秒。这也消除了GIL控制警告。

注意:我已在GitHub上为此提交了issue