Question

我有很长的推文列表（200万条），并且我使用正则表达式来搜索和替换这些推文中的文本。

我使用joblib.Parallel map（joblib是scikit-learn使用的并行后端）运行它。

我的问题是我可以在Windows的任务管理器中看到我的脚本未使用每个CPU的100％。它不占用100％的RAM或磁盘。所以我不明白为什么它不会更快。

某处可能有同步延迟，但是我找不到什么地方。

代码：

# file main.py
import re
from joblib import delayed, Parallel

def make_tweets():
    tweets = load_from_file()  # this is list of strings

    regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
    mydict = {'aa': 'A', 'bb': 'B'}  

    def handler(match):
        return mydict[match[0].replace(' ', '')]

    def replace_in(tweet)
        return re.sub(regex, handler, tweet)

    # -1 mean all cores
    # I have 6 cores that can run 12 threads
    with Parallel(n_jobs=-1) as parallel:
        tweets2 = parallel(delayed(replace_in)(tweet) for tweet in tweets)

    return tweets2

这是任务管理器：

编辑：最后一个字

答案是joblib同步降低了工作进程的速度：joblib将小消息（一个接一个？）发送给工作人员，这使他们等待。将multiprocessing.Pool.map与块大小为len(tweets)/cpu_count()一起使用会使工作人员利用100％的CPU。

使用joblib，运行时间约为1200万。使用多重处理则为400万。使用multiprocessing，每个工作线程消耗约50mb的内存。

Answer 1

经过一番游戏后，我认为这是因为joblib花费了所有时间来协调所有内容的并行运行，而没有时间实际进行任何有用的工作。至少对于OSX和Linux下的我来说-我没有任何MS Windows计算机

我首先加载程序包，提取您的代码，然后生成一个虚拟文件：

from random import choice
import re

from multiprocessing import Pool
from joblib import delayed, Parallel

regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
mydict = {'aa': 'A', 'bb': 'B'}  

def handler(match):
    return mydict[match[0].replace(' ', '')]

def replace_in(tweet):
    return re.sub(regex, handler, tweet)

examples = [
    "Regex replace isn't that computationally expensive... I would suggest using Pandas, though, rather than just a plain loop",
    "Hmm I don't use pandas anywhere else, but if it makes it faster, I'll try! Thanks for the suggestion. Regarding the question: expensive or not, if there is no reason for it to use only 19%, it should use 100%"
    "Well, is tweets a generator, or an actual list?",
    "an actual list of strings",
    "That might be causing the main process to have the 419MB of memory, however, that doesn't mean that list will be copied over to the other processes, which only need to work over slices of the list",
    "I think joblib splits the list in roughly equal chunks and sends these chunks to the worker processes.",
    "Maybe, but if you use something like this code, 2 million lines should be done in less than a minute (assuming an SSD, and reasonable memory speeds).",
    "My point is that you don't need the whole file in memory. You could type tweets.txt | python replacer.py > tweets_replaced.txt, and use the OS's native speeds to replace data line-by-line",
    "I will try this",
    "no, this is actually slower. My code takes 12mn using joblib.parallel and for line in f_in: f_out.write(re.sub(..., line)) takes 21mn. Concerning CPU and memory usage: CPU is same (17%) and memory much lower (60Mb) using files. But I want to minimize time spent, not memory usage.",
    "I moved this to chat because StackOverflow suggested it",
    "I don't have experience with joblib. Could you try the same with Pandas? pandas.pydata.org/pandas-docs/…",
]

with open('tweets.txt', 'w') as fd:
    for i in range(2_000_000):
        print(choice(examples), file=fd)

（看看您是否能猜出我从哪里得到的信息！）

作为基准，我尝试使用天真的解决方案：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    for l in fin:
        fout.write(replace_in(l))

这在我的OSX笔记本电脑上需要14.0s（挂钟时间），在我的Linux桌面上需要5.15s。请注意，将您对replace_in的定义更改为使用do regex.sub(handler, tweet)而不是re.sub(regex, handler, tweet)可以在笔记本电脑上将上述更改降低到8.6s，但下面我将不使用此更改。

然后我尝试了您的joblib包裹：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Parallel(n_jobs=-1) as parallel:
        for l in parallel(delayed(replace_in)(tweet) for tweet in fin):
            fout.write(l)

在笔记本电脑上需要1分钟16s，在台式机上需要34.2s。 CPU利用率很低，因为孩子/工人的任务大部分时间都在等待协调员向他们发送工作。

然后我尝试使用multiprocessing软件包：

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Pool() as pool:
        for l in pool.map(replace_in, fin, chunksize=1024):
            fout.write(l)

在笔记本电脑上花费了5.95s，在台式机上花费了2.60s。我还尝试使用8的块大小，分别花费了22.1s和8.29s。块大小使池可以向其子级发送大量工作，因此它可以花费更少的时间进行协调，而将更多的时间用于完成有用的工作。

因此，我可能会猜测joblib并不像doesn't seem to have a notion of chunksize那样对这种用法特别有用。

为什么此并行搜索和替换不使用100％的CPU？

1 个答案: