我使用open
在某些python代码中使用gzip.open
切换到multiprocessing
,而在以前版本的代码中非空的文件现在最终为空。 这是一个已知问题吗?我无法在互联网上找到有关类似问题的信息。
我的代码受this solution启发,但我使用imap_unordered
将工作人员设置为循环中的apply_async
而不是multiprocessing.Pool
。
我会尝试制作一个最小的工作示例,并在必要时将其添加到此帖中,但如果问题众所周知,我已经对情况进行了口头描述:
我有一个有效的python程序,其中使用池imap_unordered
方法在multiprocessing.Manager.Queue
个工作人员中进行计算。
这些计算必须将数据写入一些常见文件。这是通过put
进行通信来实现的。 worker函数将此队列作为参数,并使用队列的open
方法向其发送信息。
“writer”函数将队列作为参数和一堆文件路径,使用w
模式中get
个文件的路径。根据通过队列apply_async
方法收到的信息,将事物写入其中一个文件。
“writer”函数及其参数列表将传递给池的gzip.open
方法。
所有这些似乎都能正常工作,并且我获取了里面写有内容的文件。
现在我想用gzip以压缩格式编写它。我只是使用open
而不是wb
,并以#!/usr/bin/env python3
from multiprocessing import Manager, Pool, cpu_count
import time
from gzip import open as gzopen
def writer(queue, path1, path2):
with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2:
while True:
(where, what) = queue.get()
print("I'm a writer. I have to write:\n%s to %s" % (what, where))
if where == "out1":
f1.write(what)
elif where == "out2":
f2.write(what)
else:
print("flushing files")
f1.flush()
f2.flush()
break
def do_divmod(num_and_queue):
(num, queue) = num_and_queue
q, r = divmod(num, 2)
time.sleep(1)
if r:
queue.put(("out2", "q: %d\n" % q))
else:
queue.put(("out1", "q: %d\n" % q))
time.sleep(1)
return (num, q, r)
def main():
with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool:
write_queue = mgr.Queue()
pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
for (n, q, r) in pool.imap_unordered(
do_divmod,
((number, write_queue) for number in range(25))):
print("%d %% 2 = %d" % (n, r))
print("%d / 2 = %d" % (n, q))
write_queue.put(("", ""))
if __name__ == "__main__":
main()
模式打开文件。除此之外,我在文件路径中添加了“.gz”后缀这一事实,一切都是一样的。
程序运行时没有错误消息,但我最终得到空文件。
gzip模块是否无法用于多处理?
/tmp/out1.txt.gz
运行上述代码会导致空/tmp/out2.txt.gz
和print("I'm a writer. I have to write:\n%s to %s" % (what, where))
。
我不得不说我在使用非gzip版本时也遇到了问题:在这两种情况下,$ ./test_multiprocessing.py
I'm a writer. I have to write:
q: 0
to out2
1 % 2 = 1
1 / 2 = 0
10 % 2 = 0
10 / 2 = 5
3 % 2 = 1
3 / 2 = 1
6 % 2 = 0
6 / 2 = 3
7 % 2 = 1
7 / 2 = 3
4 % 2 = 0
4 / 2 = 2
5 % 2 = 1
5 / 2 = 2
0 % 2 = 0
0 / 2 = 0
11 % 2 = 1
11 / 2 = 5
8 % 2 = 0
8 / 2 = 4
12 % 2 = 0
12 / 2 = 6
9 % 2 = 1
9 / 2 = 4
2 % 2 = 0
2 / 2 = 1
13 % 2 = 1
13 / 2 = 6
14 % 2 = 0
14 / 2 = 7
15 % 2 = 1
15 / 2 = 7
16 % 2 = 0
16 / 2 = 8
17 % 2 = 1
17 / 2 = 8
18 % 2 = 0
18 / 2 = 9
19 % 2 = 1
19 / 2 = 9
20 % 2 = 0
20 / 2 = 10
21 % 2 = 1
21 / 2 = 10
22 % 2 = 0
22 / 2 = 11
23 % 2 = 1
23 / 2 = 11
24 % 2 = 0
24 / 2 = 12
似乎只执行一次:
$data = json_decode('./data.json'); // decode the json
$data->something = $_POST['something']; // update the something property
file_put_contents('./data.json', json_encode($data)); // write back to data.json
但至少,当非gzip版本说它正在写一些文件时,文件中有一些内容。
答案 0 :(得分:0)
我根据the documentation of the multiprocessing module中的示例尝试了一些修改。我似乎可以通过使用get
返回的apply_async
方法来强制写入gzip压缩文件:
#!/usr/bin/env python3
from multiprocessing import Manager, Pool, cpu_count
import time
from gzip import open as gzopen
def writer(queue, path1, path2):
with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2:
while True:
(where, what) = queue.get()
print("I'm a writer. I have to write:\n%s to %s" % (what, where))
# The encode seems necessary when things are actually written
# (not necessary with files obtained with the normal open)
if where == "out1":
f1.write(what.encode())
elif where == "out2":
f2.write(what.encode())
else:
print("flushing files")
f1.flush()
f2.flush()
break
def do_divmod(num_and_queue):
(num, queue) = num_and_queue
q, r = divmod(num, 2)
time.sleep(1)
if r:
queue.put(("out2", "q: %d\n" % q))
else:
queue.put(("out1", "q: %d\n" % q))
time.sleep(1)
return (num, q, r)
def main():
with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool:
write_queue = mgr.Queue()
# getting the "result object"
writing = pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
for (n, q, r) in pool.imap_unordered(
do_divmod,
((number, write_queue) for number in range(25))):
print("%d %% 2 = %d" % (n, r))
print("%d / 2 = %d" % (n, q))
write_queue.put(("", ""))
# Magic command to force the writing of the gzipped files
# (not necessary when files are obtained through normal open)
writing.get(timeout=1)
if __name__ == "__main__":
main()
我不知道为什么这会起作用,apply_async
返回的实际情况是什么。 documentation并没有多说这个“返回结果对象”。欢迎提供更多解释。
请注意,上述代码仍有错误。它解决了gzip文件为空的初始问题,但没有写作者显然只做一次写作的问题。
事实证明,获取pool.apply_async
的结果并在其上运行get
实际上并不是写入gzip压缩文件的唯一方法。
将what
更改为what.encode()
也会强制写入:
#!/usr/bin/env python3
from multiprocessing import Manager, Pool, cpu_count
import time
from gzip import open as gzopen
def writer(queue, path1, path2):
with gzopen(path1, "wb") as f1, gzopen(path2, "wb") as f2:
while True:
(where, what) = queue.get()
print("I'm a writer. I have to write:\n%s to %s" % (what, where))
if where == "out1":
# The encode method call seems to force the writing
f1.write(what.encode())
#f1.write(what)
elif where == "out2":
# The encode method call seems to force the writing
f2.write(what.encode())
#f2.write(what)
else:
print("flushing files")
f1.flush()
f2.flush()
break
def do_divmod(num_and_queue):
(num, queue) = num_and_queue
q, r = divmod(num, 2)
time.sleep(1)
if r:
queue.put(("out2", "q: %d\n" % q))
else:
queue.put(("out1", "q: %d\n" % q))
time.sleep(1)
return (num, q, r)
def main():
with Manager() as mgr, Pool(processes=cpu_count() - 2) as pool:
write_queue = mgr.Queue()
#writing = pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
pool.apply_async(writer, (write_queue, "/tmp/out1.txt.gz", "/tmp/out2.txt.gz"))
for (n, q, r) in pool.imap_unordered(
do_divmod,
((number, write_queue) for number in range(25))):
print("%d %% 2 = %d" % (n, r))
print("%d / 2 = %d" % (n, q))
write_queue.put(("", ""))
#writing.get(timeout=1)
if __name__ == "__main__":
main()
我仍然对正在发生的事情毫无头绪......