Question

我创建了一个循环遍历文件的类，在检查一行是否有效后，它会将该行写入另一个文件。它检查的每一行都是一个漫长的过程，使得它很慢。我需要在process_file函数中实现线程/多处理;我不知道哪个库最适合加速此功能或如何实现它。

class FileProcessor:
    def process_file(self):
        with open('file.txt', 'r') as f:
            with open('outfile.txt', 'w') as output:
                for line in f:
                    # There's some string manipulation code here...
                    validate = FileProcessor.do_stuff(self, line)
                    # If true write line to output.txt
    def do_stuff(self, line)
        # Does stuff...
        pass

额外信息：代码通过代理列表检查它是否在线。这是一个漫长而耗时的过程。

感谢您的任何见解或帮助！

Answer 1

代码通过代理列表检查它是否在线

听起来好像需要很长时间才能连接到互联网，这意味着你的任务是IO绑定的，因此线程可以帮助加快速度。多个过程始终适用，但可能更难使用。

Answer 2

这似乎是multiprocessing.map的工作。

import multiprocessing

def process_file(filename):
    pool = multiprocessing.Pool(4)
    with open(filename) as fd:
        results = pool.imap_unordered(do_stuff, (line for line in fd))
        with open("output.txt", "w") as fd:
            for r in results:
                fd.write(r)

def do_stuff(item):
    return "I did something with %s\n" % item

process_file(__file__)

如果你想使用线程（在这种情况下可能更好，因为你的I / O绑定），你也可以使用multiprocessing.dummy.Pool。

基本上，您将迭代传递给imap_unordered（如果订单很重要，则为imap），并将其中的部分内容转移到其他进程（如果使用虚拟进程，则为线程）。您可以调整地图的chunksize以提高效率。

如果要将其封装到一个类中，则需要使用multiprocessing.dummy。（否则它无法挑选实例方法。）

你必须等到地图结束才能处理结果，尽管你可以在do_stuff中写出结果 - 只需确保以附加模式打开文件，然后你就可以了。我可能想要lock the file。

Python：如何一次浏览文件中的多行？

2 个答案: