Question

我正在尝试在python中读取一个文件（扫描它并寻找术语）并写下结果 - 比方说，每个术语的计数器。我需要为大量文件（超过3000个）做到这一点。有可能做多线程吗？如果是，怎么样？

所以，情况是这样的：

阅读每个文件并扫描其行
将计数器写入我读过的所有文件的同一输出文件。

第二个问题是，它是否提高了读/写速度。

希望它足够清楚。谢谢，

罗恩。

Answer 1

我同意@aix，multiprocessing绝对是要走的路。无论你是否受到限制 - 无论你运行了多少个并行进程，你都只能读得这么快。但是很容易就会有一些加速。

考虑以下内容（输入/是包含来自Project Gutenberg的几个.txt文件的目录）。

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

当我在双核机器上运行时，有一个明显的（但不是2倍）加速：

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

如果文件足够小以适应内存，并且您需要完成大量不受i / o约束的处理，那么您应该会看到更好的改进。

Answer 2

是的，应该可以并行执行此操作。

然而，在Python中，很难实现多线程的并行性。因此，multiprocessing是并行处理的最佳默认选择。

很难说你能达到什么样的加速。这取决于可以并行完成的工作量的比例（越多越好），以及必须连续完成的部分（越少越好）。

在python中使用多线程读取txt文件

2 个答案: