用Python多处理文件,然后将结果写入磁盘

时间:2019-03-25 23:56:36

标签: python python-multiprocessing

我想执行以下操作:

  • 从csv文件读取数据
  • 处理所述csv的每一行(假设这是一个长时间的网络操作)
  • 将结果写入另一个文件

我曾尝试将thisthis的答案结合在一起,但收效甚微。 第二个队列的代码永远不会被调用,因此不会发生磁盘写操作。如何让进程知道第二个队列?

请注意,我不一定是 File "/usr/local/lib/python3.4/dist-packages/httplib2/__init__.py", line 1533, in _conn_request response = conn.getresponse() File "/usr/lib/python3.4/http/client.py", line 1208, in getresponse response.begin() File "/usr/lib/python3.4/http/client.py", line 380, in begin version, status, reason = self._read_status() File "/usr/lib/python3.4/http/client.py", line 342, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.4/socket.py", line 374, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.4/ssl.py", line 769, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.4/ssl.py", line 641, in read v = self._sslobj.read(len, buffer) socket.timeout: The read operation timed out 的粉丝。如果multiprocessing / async工作得更好,我全力以赴。

到目前为止我的代码

await

1 个答案:

答案 0 :(得分:2)

我在尝试执行您的代码时遇到的第一个问题是:

An attempt has been made to start a new process before the current process has finished 
its bootstrapping phase. This probably means that you are not using fork to start your 
child processes and you have forgotten to use the proper idiom in the main module

我必须将所有模块作用域指令包装在if __name__ == '__main__':惯用语中。 Read more here

由于您的目标是遍历文件的各行,因此Pool.imap()似乎很合适。 imap()文档是指map()文档,不同之处在于imap()懒惰地从可迭代对象(在您的情况下将为csv文件)中提取下一个项目,如果您的csv文件很大。因此,来自map()文档:

  

此方法将迭代器切成许多块,   作为单独的任务提交到流程池。

imap()返回一个迭代器,这样您就可以对流程工作者产生的结果进行迭代,以对它们进行处理(在您的示例中,是将结果写入文件中)

这是一个有效的示例:

import multiprocessing
import os
import time


def worker_main(item):
    print(os.getpid(), "got", item)
    time.sleep(1) #long network processing
    print(os.getpid(), "done", item)
    # put the processed items to be written to disl
    return "processed:" + str(item)


if __name__ == '__main__':
    with multiprocessing.Pool(3) as pool:
        with open('out.txt', 'w') as file:
            # range(5) simulating a 5 row csv file.
            for proc_row in pool.imap(worker_main, range(5)):
                file.write(proc_row + '\n')

# printed output:
# 1368 got 0
# 9228 got 1
# 12632 got 2
# 1368 done 0
# 1368 got 3
# 9228 done 1
# 9228 got 4
# 12632 done 2
# 1368 done 3
# 9228 done 4

out.txt看起来像这样:

processed:0
processed:1
processed:2
processed:3
processed:4

请注意,我也不必使用任何队列。