Question

我有一个非常普通的生产者/消费者场景，但有一个转折点。

我需要从数GB的输入流（可以是文件或HTTP流）中读取文本行；使用慢速且占用大量CPU的算法处理每一行，该算法将为输入的每一行输出一行文本；然后将输出行写入另一个流。所不同的是，我需要按照与产生它们的输入线相同的顺序来编写输出线。

这些情况的通常方法是使用多处理池来运行CPU密集型算法，其中一个队列从读取器进程中馈入行（实际上是成批的行），而另一个队列从池中引出并进入编写器过程：

                       / [Pool] \    
  [Reader] --> InQueue --[Pool]---> OutQueue --> [Writer]
                       \ [Pool] /

但是如何确定输出行（或批次）的排序正确？

一个简单的答案是，“只需将它们写入一个临时文件，然后对该文件进行排序并将其写入输出”。我可能最终会这样做，但是我真的很想尽快开始流输出行-而不是等待从头到尾处理整个输入流。

我可以轻松地编写自己的multiprocessing.Queue实现，该实现将使用Dictionary（或循环缓冲区列表），一个Lock和两个Condition（可能还有一个整数计数器）在内部对其项进行排序。但是，我需要从Manager中获取所有这些对象，而且恐怕在多个进程之间使用这样的共享状态会降低性能。那么，有没有解决此问题的适当Pythony方法？

Answer 1

也许我遗漏了一些东西，但是看来您的问题有一个基本答案。

让我们举一个简单的例子：我们只想反转文本中的行。这是我们要撤消的行：

INPUT = ["line {}".format(i)[::-1] for i in range(30)]

也就是说：

['0 enil', '1 enil', '2 enil', ..., '92 enil']

这是反转这些行的方法：

import time, random

def process_line(line):
    time.sleep(1.5*random.random()) # simulation of an heavy computation
    return line[::-1]

这些行来自来源：

def source():
    for s in INPUT:
        time.sleep(0.5*random.random()) # simulation of the network latency
        yield s

我们可以使用多重处理来提高速度：

from multiprocessing import *

with Pool(3) as p:
    for line in p.imap_unordered(process_line, source()):
        print(line)

但是我们的行没有按预期的顺序排列：

line 0
line 2
line 1
line 5
line 3
line 4
...
line 27
line 26
line 29
line 28

要按预期顺序获得该行，您可以：

索引行
处理它们并
按预期顺序收集它们。

首先，对行进行索引：

def wrapped_source():
    for i, s in enumerate(source()):
        yield i, s

第二，处理该行，但保留索引：

def wrapped_process_line(args):
    i, line = args
    return i, process_line(line)

第三，按预期顺序收集行。这个想法是使用计数器一堆计数器是下一行的预期索引。

采用下一对（索引，处理过的行）：

如果索引等于计数器，则只产生已处理的行。
如果没有，请将对（索引，已处理的行）存储在堆中。

然后，虽然堆中最小的索引等于计数器，但弹出最小的元素并产生该行。

循环直到源为空，然后刷新堆。

from heapq import *
h = []

with Pool(3) as p:
    expected_i = 0 #
    for i, line in p.imap_unordered(wrapped_process_line, wrapped_source()):
        if i == expected_i: # lucky!
            print(line)
            expected_i += 1
        else: # unlucky!
            heappush(h, (i, line)) # store the processed line

        while h: # look for the line "expected_i" in the heap
            i_smallest, line_smallest = h[0] # the smallest element
            if i_smallest == expected_i:
                heappop(h)
                print(line_smallest)
                expected_i += 1
            else:
                break # the line "expect_i" was not processed yet.

    while h: # flush the heap
        print(heappop(h)[1]) # the line

现在，我们的行按预期顺序显示：

line 0
line 1
line 2
line 3
line 4
line 5
...
line 26
line 27
line 28
line 29

没有附加延迟：如果未处理下一个预期行，我们必须等待，但是一旦此行到达，我们就会产生它。

主要缺点是您必须手动处理（超时，新请求等）潜在的差距：一旦索引了行，如果您松开了一行（无论出于何种原因），则循环将等待此操作直到源用尽，然后才刷新堆。在这种情况下，您可能会用完内存。

使用多处理实现排序的生产者/消费者队列

1 个答案: