Question

我正在尝试对dask.bag进行一些测试，以准备在数百万个文本文件上进行大文本处理工作。现在，在我的几十到几十万个文本文件的测试集上，我发现dask运行速度比直接单线程文本处理函数慢5到6倍。

有人能解释一下我在哪里可以看到在大量文本文件上运行dask的速度优势吗？在开始变速之前我需要处理多少个文件？ 150,000个小文本文件太少了吗？我应该调整什么样的性能参数来使dask在处理文件时加快速度？与直接单线程文本处理相比，什么能够导致性能下降5倍？

这是我用来测试dask的代码示例。这是针对来自路透社的一组测试数据运行的：

http://www.daviddlewis.com/resources/testcollections/reuters21578/

此数据与我正在使用的数据不完全相同。在我的另一种情况下，它是一堆单独的文本文件，每个文件一个文档，但我看到的性能下降大致相同。这是代码：

import dask.bag as db
from collections import Counter
import string
import glob
import datetime

my_files = "./reuters/*.ascii"

def single_threaded_text_processor():
    c = Counter()
    for my_file in glob.glob(my_files):
        with open(my_file, "r") as f:
            d = f.read()
            c.update(d.split())
    return(c)

start = datetime.datetime.now()
print(single_threaded_text_processor().most_common(5))
print(str(datetime.datetime.now() - start))

start = datetime.datetime.now()
b = db.read_text(my_files)
wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])
print(str([w for w in wordcount]))
print(str(datetime.datetime.now() - start))

以下是我的结果：

[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]
0:00:02.958721
[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]
0:00:17.877077

Answer 1

Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.

The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.

使用Python Dask包降低性能？

1 个答案: