我有一个包含10万个文件的文件夹,总计50GB。目标是读取每个文件,运行一些正则表达式来存储数据。我正在尝试进行测试,以查看哪种方法(多线程或多处理)将是最理想的。
我正在使用的服务器具有4核和8GB RAM。 没有任何多线程,大约需要5分钟才能完成任务。
from concurrent.futures import ThreadPoolExecutor
threads= []
def read_files(filename):
with open(filename, 'r') as f:
text = f.read()
with ThreadPoolExecutor(max_workers=50) as executor:
for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True):
threads.append(executor.submit(read_files, filename))
多线程平均耗时1分钟30秒。
现在,我正在尝试为多处理设置测试,并使用服务器上的4个核心,而且没有任何地方。
from multiprocessing import Lock, Process, Queue, current_process
import time
import queue
def read_files(tasks_to_accomplish):
while True:
try:
filename = tasks_to_accomplish.get_nowait()
with open(filename, 'r') as f:
text = f.read()
except queue.Empty:
break
def main():
number_of_processes = 4
tasks_to_accomplish = Queue()
processes = []
for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True):
tasks_to_accomplish.put(filename)
# creating processes
for w in range(number_of_processes):
p = Process(target=read_files, args=(tasks_to_accomplish,))
processes.append(p)
p.start()
# completing process
for p in processes:
p.join()
if __name__ == '__main__':
main()
请帮助!
答案 0 :(得分:1)
由于您已经在使用concurrent.futures
,因此我建议您使用ProcessPoolExecutor
,它位于multiprocessing
的上方,类似于ThreadPoolExecutor
位于{{ 1}}。这些类具有几乎完全相同的API
https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor