Question

我有一个包含10万个文件的文件夹，总计50GB。目标是读取每个文件，运行一些正则表达式来存储数据。我正在尝试进行测试，以查看哪种方法（多线程或多处理）将是最理想的。

我正在使用的服务器具有4核和8GB RAM。没有任何多线程，大约需要5分钟才能完成任务。

from concurrent.futures import ThreadPoolExecutor
threads= []

def read_files(filename):
     with open(filename, 'r') as f:
             text = f.read()

with ThreadPoolExecutor(max_workers=50) as executor:
     for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True):            
         threads.append(executor.submit(read_files, filename))

多线程平均耗时1分钟30秒。

现在，我正在尝试为多处理设置测试，并使用服务器上的4个核心，而且没有任何地方。

from multiprocessing import Lock, Process, Queue, current_process
import time
import queue 


def read_files(tasks_to_accomplish):
    while True:
        try:            
            filename = tasks_to_accomplish.get_nowait()
            with open(filename, 'r') as f:
                text = f.read()
        except queue.Empty:
            break    


def main():    
    number_of_processes = 4
    tasks_to_accomplish = Queue()    
    processes = []

    for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True):        
        tasks_to_accomplish.put(filename)
        
    # creating processes
    for w in range(number_of_processes):
        p = Process(target=read_files, args=(tasks_to_accomplish,))
        processes.append(p)
        p.start()

    # completing process
    for p in processes:
        p.join()

   
if __name__ == '__main__':
    main()

请帮助！

Answer 1

由于您已经在使用concurrent.futures，因此我建议您使用ProcessPoolExecutor，它位于multiprocessing的上方，类似于ThreadPoolExecutor位于{{ 1}}。这些类具有几乎完全相同的API

https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor

比较100K文件上的Python多处理和多线程

1 个答案: