Question

我有一个庞大的文件，需要阅读并处理。

with open(source_filename) as source, open(target_filename) as target:
    for line in source:
        target.write(do_something(line))

    do_something_else()

这可以通过线程加速吗？如果我每行产生一个线程，这会产生巨大的开销吗？

修改：为了使这个问题不是讨论，代码应该如何？

with open(source_filename) as source, open(target_filename) as target:
   ?

@Nicoretti：在迭代中，我需要读取一行数KB的数据。

更新2：该文件可能是bz2，因此Python可能必须等待解压缩：

$ bzip2 -d country.osm.bz2 | ./my_script.py

Answer 1

您可以使用三个线程：用于阅读，处理和写作。可能的优点是可以在等待I / O时进行处理，但是您需要自己考虑一些时间，以确定在您的情况下是否有实际的好处。

import threading
import Queue

QUEUE_SIZE = 1000
sentinel = object()

def read_file(name, queue):
    with open(name) as f:
        for line in f:
            queue.put(line)
    queue.put(sentinel)

def process(inqueue, outqueue):
    for line in iter(inqueue.get, sentinel):
        outqueue.put(do_something(line))
    outqueue.put(sentinel)

def write_file(name, queue):
    with open(name, "w") as f:
        for line in iter(queue.get, sentinel):
            f.write(line)

inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)

threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)

最好为队列设置maxsize以防止不断增加的内存消耗。 1000的值是我的任意选择。

Answer 2

处理阶段是否需要相对较长的时间，即cpu-intenstive？如果没有，那么不，你通过线程化或多处理它不会赢得太多。如果您的处理费用昂贵，那么是的。所以，你需要剖析才能确定。

如果你花费相对更多的时间来阅读文件，即它比处理文件大，那么你就无法通过使用线程获得性能，瓶颈就是线程不会改进的IO。

Answer 3

这是你不应该尝试分析先验的事情，而应该分析。

请记住，线程只有在每行处理很重时才有用。另一种策略是将整个文件粘贴到内存中，并在内存中处理它，这可能会避免线程化。

你是否每行都有一个线程，再一次是用于微调的东西，但我的猜测是，除非解析这些行很重，否则你可能想要使用固定数量的工作线程。

还有另一种选择：生成子流程，并让它们进行读取和处理。鉴于您对问题的描述，我希望这能为您提供最快的速度。您甚至可以使用某种内存缓存系统来加速读取，例如memcached（或任何类似的系统，甚至是关系数据库）。

Answer 4

在CPython中，线程受global interpreter lock的限制 - 一次只有一个线程可以实际执行Python代码。因此，线程化只有在以下情况下才会对您有所帮助：

您正在进行不需要全局解释器锁定的处理;或
您在I / O上花费时间。

（1）的示例包括将滤波器应用于Python Imaging Library中的图像，或者在numpy中找到矩阵的特征值。（2）的示例包括等待用户输入或等待网络连接完成发送数据。

因此，是否可以使用CPython中的线程加速代码取决于您在do_something调用中的具体操作。（但是如果你在Python中解析这一行，你就不太可能通过启动线程来加快速度。）你还应该注意，如果你开始启动线程，那么当你将结果写入时，你将面临同步问题。目标文件。无法保证线程的完成顺序与它们的启动顺序相同，因此您必须注意确保输出的顺序正确。

这是一个最大线程实现，它包含用于读取输入，写入输出和一个线程来处理每一行的线程。只有测试会告诉你这是否比单线程版本更快或更慢（或只有三个线程的Janne版本）。

from threading import Thread
from Queue import Queue

def process_file(f, source_filename, target_filename):
    """
    Apply the function `f` to each line of `source_filename` and write
    the results to `target_filename`. Each call to `f` is evaluated in
    a separate thread.
    """
    worker_queue = Queue()
    finished = object()

    def process(queue, line):
        "Process `line` and put the result on `queue`."
        queue.put(f(line))

    def read():
        """
        Read `source_filename`, create an output queue and a worker
        thread for every line, and put that worker's output queue onto
        `worker_queue`.
        """
        with open(source_filename) as source:
            for line in source:
                queue = Queue()
                Thread(target = process, args=(queue, line)).start()
                worker_queue.put(queue)
        worker_queue.put(finished)

    Thread(target = read).start()
    with open(target_filename, 'w') as target:
        for output in iter(worker_queue.get, finished):
            target.write(output.get())

如何利用Python中的线程解析大文件？

4 个答案: