我正在尝试使用并发期货来掌握多线程/多处理。
我尝试使用以下几组代码。我知道我将始终遇到磁盘IO问题,但我希望尽可能最大限度地提高我的内存和CPU使用率。
大规模处理最常用/最佳方法是什么方法?
如何使用并行期货处理大型数据集?
是否有比下面更优选的方法?
方法1:
for folders in os.path.isdir(path):
p = multiprocessing.Process(pool.apply_async(process_largeFiles(folders)))
jobs.append(p)
p.start()
方法2:
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
for folders in os.path.isdir(path):
executor.submit(process_largeFiles(folders), 100)
方法3:
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
for folders in os.path.isdir(path):
executor.submit(process_largeFiles(folders), 10)
我应该尝试一起使用进程池和线程池吗?
方法(思考):
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as process:
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as thread:
for folders in os.path.isdir(path):
process.submit(thread.submit(process_largeFiles(folders), 100),10)
在最广泛的用例中,最有效的方法是什么才能最大化我的ram和cpu?
我知道启动进程需要花费一些时间,但是我的文件大小会被处理掉吗?
答案 0 :(得分:0)
使用TreadPoolExecutor打开并读取文件,然后使用ProcessPoolExecutor处理数据。
import concurrent.futures
from collections import deque
TPExecutor = concurrent.futures.ThreadPoolExecutor
PPExecutor = concurrent.futures.ProcessPoolExecutor
def get_file(path):
with open(path) as f:
data = f.read()
return data
def process_large_file(s):
return sum(ord(c) for c in s)
files = [filename1, filename2, filename3, filename4, filename5,
filename6, filename7, filename8, filename9, filename0]
results = []
completed_futures = collections.deque()
def callback(future, completed=completed_futures):
completed.append(future)
with TPExecutor(max_workers = 4) as thread_pool_executor:
data_futures = [thread_pool_executor.submit(get_file, path) for path in files]
with PPExecutor() as process_pool_executor:
for data_future in concurrent.futures.as_completed(data_futures):
future = process_pool_executor.submit(process_large_file, data_future.result())
future.add_done_callback(callback)
# collect any that have finished
while completed_futures:
results.append(completed_futures.pop().result())
使用了完成回调,因此不必等待完成的期货交易。我不知道这如何影响效率-主要是用来简化as_completed
循环中的逻辑/代码。
如果由于内存限制而需要限制文件或数据提交,则需要对其进行重构。取决于文件读取时间和处理时间,很难说在任何给定时刻内存中有多少数据。我认为在as_completed
中收集结果应有助于减轻这种情况。 data_futures
可能会在设置ProcessPoolExecutor时开始完成-可能需要优化排序。