我正在尝试并行地从URL块中读取非常大的bz2文件,并分别解压缩每个块。当我尝试在流程工作者函数外部解压缩块时,它可以正常工作。但是,当子进程尝试解压缩相同的块时,它将抛出OSError: Invalid data stream
异常。
下面的代码是完整的代码。我正在运行Python 3.5.2。
import bz2
import urllib3
import multiprocessing as mp
def parse_chunk():
decompressor = bz2.BZ2Decompressor()
global q
while True:
chunk = q.get()
if chunk is None:
break
# Decompression here fails
decompressed_chunk = decompressor.decompress(chunk).decode("utf-8")
decompressor_main = bz2.BZ2Decompressor()
http = urllib3.PoolManager()
r = http.request(
'GET',
'https://url_to_file.bz2',
preload_content=False)
last_line = False
q = mp.Queue(maxsize=5)
pool = mp.Pool(5, initializer=parse_chunk)
for chunk in r.stream(1024*100):
# Decompression here works
decompressed_chunk = decompressor_main.decompress(chunk).decode("utf-8")
q.put(chunk)
q.put(None)