Question

我是Python的新手，我需要在我的代码中实现多线程。

我有一个巨大的.csv文件（百万行）作为我的输入。我读了一行，为每一行做了一个休息请求，在每一行上做了一些处理，然后把输出写到另一个文件中。输入/输出文件中的行的排序很重要。现在我逐行做这个。我想运行相同的代码，但是并行，即从.csv文件中读取20行输入，并使其余的并行调用，以便我的程序更快。

我一直在阅读http://docs.python.org/2/library/queue.html，但我读到了python GIL问题，该问题表明即使在多线程之后代码也不会运行得更快。有没有其他方法可以通过简单的方式实现多线程？

Answer 1

你能将.csv文件分成多个较小的文件吗？如果可以，那么您可以使用另一个程序运行多个版本的处理器。

假设文件全部命名为 file1 ， file2 等，并且您的处理者将文件名作为参数。你可以：

import subprocess
import os
import signal

for i in range(1,numfiles):
    program = subprocess.Popen(['python'], "processer.py", "file" + str(i))
    pid = program.pid

    #if you need to kill the process:
    os.kill(pid, signal.SIGINT)

Answer 2

Python在IO上发布GIL。如果大部分时间花在休息请求上;你可以使用线程来加速处理：

try:
    from gevent.pool import Pool # $ pip install gevent
    import gevent.monkey; gevent.monkey.patch_all() # patch stdlib
except ImportError: # fallback on using threads
    from multiprocessing.dummy import Pool

import urllib2    

def process_line(url):
    try:
        return urllib2.urlopen(url).read(), None
    except EnvironmentError as e:
        return None, e

with open('input.csv', 'rb') as file, open('output.txt', 'wb') as outfile:
    pool = Pool(20) # use 20 concurrent connections
    for result, error in pool.imap_unordered(process_line, file):
        if error is None:
            outfile.write(result)

如果输入/输出顺序应该相同;您可以使用imap代替imap_unordered。

如果您的程序受CPU限制;你可以使用创建多个进程的multiprocessing.Pool()。

另见Python Interpreter blocks Multithreaded DNS requests?

This answer shows how to create a thread pool manually using threading + Queue modules

使用队列在python中进行多线程处理

2 个答案: