在必须对文件进行gzip压缩时,使用函数编写无法正常编写文件的队列

时间:2017-03-13 12:17:09

标签: python gzip python-multiprocessing

我使用open在某些python代码中使用gzip.open切换到multiprocessing,而在以前版本的代码中非空的文件现在最终为空。 这是一个已知问题吗?我无法在互联网上找到有关类似问题的信息。

我的代码受this solution启发,但我使用imap_unordered将工作人员设置为循环中的apply_async而不是multiprocessing.Pool

我会尝试制作一个最小的工作示例,并在必要时将其添加到此帖中,但如果问题众所周知,我已经对情况进行了口头描述:

我有一个有效的python程序,其中使用池imap_unordered方法在multiprocessing.Manager.Queue个工作人员中进行计算。

这些计算必须将数据写入一些常见文件。这是通过put进行通信来实现的。 worker函数将此队列作为参数,并使用队列的open方法向其发送信息。

“writer”函数将队列作为参数和一堆文件路径,使用w模式中get个文件的路径。根据通过队列apply_async方法收到的信息,将事物写入其中一个文件。

“writer”函数及其参数列表将传递给池的gzip.open方法。

所有这些似乎都能正常工作,并且我获取了里面写有内容的文件。

现在我想用gzip以压缩格式编写它。我只是使用open而不是wb,并以#!/usr/bin/env python3 from multiprocessing import Manager, Pool, cpu_count import time from gzip import open as gzopen def writer(queue, path1, path2): with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2: while True: (where, what) = queue.get() print("I'm a writer. I have to write:\n%s to %s" % (what, where)) if where == "out1": f1.write(what) elif where == "out2": f2.write(what) else: print("flushing files") f1.flush() f2.flush() break def do_divmod(num_and_queue): (num, queue) = num_and_queue q, r = divmod(num, 2) time.sleep(1) if r: queue.put(("out2", "q: %d\n" % q)) else: queue.put(("out1", "q: %d\n" % q)) time.sleep(1) return (num, q, r) def main(): with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool: write_queue = mgr.Queue() pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz")) for (n, q, r) in pool.imap_unordered( do_divmod, ((number, write_queue) for number in range(25))): print("%d %% 2 = %d" % (n, r)) print("%d / 2 = %d" % (n, q)) write_queue.put(("", "")) if __name__ == "__main__": main() 模式打开文件。除此之外,我在文件路径中添加了“.gz”后缀这一事实,一切都是一样的。

程序运行时没有错误消息,但我最终得到空文件。

gzip模块是否无法用于多处理?

编辑:代码示例

/tmp/out1.txt.gz

运行上述代码会导致空/tmp/out2.txt.gzprint("I'm a writer. I have to write:\n%s to %s" % (what, where))

我不得不说我在使用非gzip版本时也遇到了问题:在这两种情况下,$ ./test_multiprocessing.py I'm a writer. I have to write: q: 0 to out2 1 % 2 = 1 1 / 2 = 0 10 % 2 = 0 10 / 2 = 5 3 % 2 = 1 3 / 2 = 1 6 % 2 = 0 6 / 2 = 3 7 % 2 = 1 7 / 2 = 3 4 % 2 = 0 4 / 2 = 2 5 % 2 = 1 5 / 2 = 2 0 % 2 = 0 0 / 2 = 0 11 % 2 = 1 11 / 2 = 5 8 % 2 = 0 8 / 2 = 4 12 % 2 = 0 12 / 2 = 6 9 % 2 = 1 9 / 2 = 4 2 % 2 = 0 2 / 2 = 1 13 % 2 = 1 13 / 2 = 6 14 % 2 = 0 14 / 2 = 7 15 % 2 = 1 15 / 2 = 7 16 % 2 = 0 16 / 2 = 8 17 % 2 = 1 17 / 2 = 8 18 % 2 = 0 18 / 2 = 9 19 % 2 = 1 19 / 2 = 9 20 % 2 = 0 20 / 2 = 10 21 % 2 = 1 21 / 2 = 10 22 % 2 = 0 22 / 2 = 11 23 % 2 = 1 23 / 2 = 11 24 % 2 = 0 24 / 2 = 12 似乎只执行一次:

$data = json_decode('./data.json'); // decode the json
$data->something = $_POST['something']; // update the something property
file_put_contents('./data.json', json_encode($data)); // write back to data.json

但至少,当非gzip版本说它正在写一些文件时,文件中有一些内容。

1 个答案:

答案 0 :(得分:0)

我根据the documentation of the multiprocessing module中的示例尝试了一些修改。我似乎可以通过使用get返回的apply_async方法来强制写入gzip压缩文件:

#!/usr/bin/env python3

from multiprocessing import Manager, Pool, cpu_count
import time
from gzip import open as gzopen

def writer(queue, path1, path2):
    with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2:
        while True:
            (where, what) = queue.get()
            print("I'm a writer. I have to write:\n%s to %s" % (what, where))
            # The encode seems necessary when things are actually written
            # (not necessary with files obtained with the normal open)
            if where == "out1":
                f1.write(what.encode())
            elif where == "out2":
                f2.write(what.encode())
            else:
                print("flushing files")
                f1.flush()
                f2.flush()
            break

def do_divmod(num_and_queue):
    (num, queue) = num_and_queue
    q, r = divmod(num, 2)
    time.sleep(1)
    if r:
        queue.put(("out2", "q: %d\n" % q))
    else:
        queue.put(("out1", "q: %d\n" % q))
    time.sleep(1)
    return (num, q, r)

def main():
    with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool:
        write_queue = mgr.Queue()
        # getting the "result object"
        writing = pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
        for (n, q, r) in pool.imap_unordered(
                do_divmod,
                ((number, write_queue) for number in range(25))):
            print("%d %% 2 = %d" % (n, r))
            print("%d / 2 = %d" % (n, q))
        write_queue.put(("", ""))
        # Magic command to force the writing of the gzipped files
        # (not necessary when files are obtained through normal open)
        writing.get(timeout=1)

if __name__ == "__main__":
   main()

我不知道为什么这会起作用,apply_async返回的实际情况是什么。 documentation并没有多说这个“返回结果对象”。欢迎提供更多解释。

请注意,上述代码仍有错误。它解决了gzip文件为空的初始问题,但没有写作者显然只做一次写作的问题。

编辑:进一步测试

事实证明,获取pool.apply_async的结果并在其上运行get实际上并不是写入gzip压缩文件的唯一方法。 将what更改为what.encode()也会强制写入:

#!/usr/bin/env python3

from multiprocessing import Manager, Pool, cpu_count
import time
from gzip import open as gzopen

def writer(queue, path1, path2):
    with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2:
        while True:
            (where, what) = queue.get()
            print("I'm a writer. I have to write:\n%s to %s" % (what, where))
            if where == "out1":
                # The encode method call seems to force the writing
                f1.write(what.encode())
                #f1.write(what)
            elif where == "out2":
                # The encode method call seems to force the writing
                f2.write(what.encode())
                #f2.write(what)
            else:
                print("flushing files")
                f1.flush()
                f2.flush()
            break

def do_divmod(num_and_queue):
    (num, queue) = num_and_queue
    q, r = divmod(num, 2)
    time.sleep(1)
    if r:
        queue.put(("out2", "q: %d\n" % q))
    else:
        queue.put(("out1", "q: %d\n" % q))
    time.sleep(1)
    return (num, q, r)

def main():
    with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool:
        write_queue = mgr.Queue()
        #writing = pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
        pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
        for (n, q, r) in pool.imap_unordered(
            do_divmod,
            ((number, write_queue) for number in range(25))):
            print("%d %% 2 = %d" % (n, r))
            print("%d / 2 = %d" % (n, q))
        write_queue.put(("", ""))
        #writing.get(timeout=1)

if __name__ == "__main__":
   main()

我仍然对正在发生的事情毫无头绪......