我为一个程序编写了代码,该程序可以处理带有“字节”数据的超大文件(例如,下面的代码中4GB表示x = 2048,y = 2048,时间= 1000)。在某些情况下,它最多可能有16GB的文件。我认为 absolute_bytearray(data)可以通过多处理至少加速四倍(因为当我运行程序时,仅加载了大约28%的CPU):
How to Multi-thread an Operation Within a Loop in Python
如何以正确的方式对我的代码应用多重处理?
from time import perf_counter
from random import getrandbits
x = 512
y = 512
time = 200
xyt = x*y*time
my_by = bytearray(getrandbits(8) for x in range(xyt))
def absolute_bytearray(data):
for i in range(len(data)):
if data[i] > 127:
data[i] = 255 - data[i]
return data
start = perf_counter()
absolute_bytearray(my_by)
end = perf_counter()
print('time abs my_by = %.2f' % (end - start)) # around 6,70s for 512*512*200
或者您也许知道更快的解决方案?
答案 0 :(得分:1)
由于您在此处处理大量数据,因此使用共享内存将是一个不错的选择,可以在并行处理作业的同时保持较低的内存占用量。 multiprocessing
模块i.a.针对这种情况提供Array
:
多重处理。数组(类型码或类型,大小或初始化器,*,锁=真)
返回从共享内存分配的ctypes数组。默认情况下,返回值 实际上是该数组的同步包装器。 docs
下面的代码还使用多个过程来创建数据。请从我的答案here中获取mp_utils
模块的代码。其中的两个函数用于在共享数组的索引上创建“合理”范围。
这些batch_ranges
发送到工作进程,并且每个进程将在这些范围内包含的索引的共享数组上工作。
import random
import ctypes
from time import perf_counter
from multiprocessing import Process, Array
from mp_utils import calc_batch_sizes, build_batch_ranges
def f(data, batch_range):
"""Target processing function."""
for i in batch_range:
if data[i] > 127:
data[i] = 255 - data[i]
def create_data(array, batch_range):
"""Fill specified range of array with random bytes."""
rd = random.Random(42) # arbitrary seed 42
getrandbits = rd.getrandbits # for speed
for i in batch_range:
array[i] = getrandbits(8)
def process_tasks(target, tasks):
"""Process tasks by starting a new process per task."""
pool = [Process(target=target, args=task) for task in tasks]
for p in pool:
p.start()
for p in pool:
p.join()
def main(x, y, time, n_workers):
xyt = x * y * time
# creating data
creation_start = perf_counter() # ----------------------------------------
# We don't need a lock here, because our processes operate on different
# subsets of the array.
sha = Array(ctypes.c_ubyte, xyt, lock=False) # initialize zeroed array
batch_ranges = build_batch_ranges(calc_batch_sizes(len(sha), n_workers))
tasks = [*zip([sha] * n_workers, batch_ranges)]
process_tasks(target=create_data, tasks=tasks)
print(f'elapsed for creation: {perf_counter() - creation_start:.2f} s') #-
print(sha[:30])
# process data
start = perf_counter() # -------------------------------------------------
process_tasks(target=f, tasks=tasks)
print(f'elapsed for processing: {perf_counter() - start:.2f} s') # -------
print(sha[:30])
if __name__ == '__main__':
N_WORKERS = 8
X = Y = 512
TIME = 200
main(X, Y, TIME, N_WORKERS)
示例输出:
elapsed for creation: 5.31 s
[163, 28, 6, 189, 70, 62, 57, 35, 188, 26, 173, 189, 228, 139, 22, 151, 108, 8, 7, 23, 55, 59, 129, 154, 6, 143, 50, 183, 166, 179]
elapsed for processing: 4.36 s
[92, 28, 6, 66, 70, 62, 57, 35, 67, 26, 82, 66, 27, 116, 22, 104, 108, 8, 7, 23, 55, 59, 126, 101, 6, 112, 50, 72, 89, 76]
Process finished with exit code 0
我正在SandyBridge(2012)计算机,8核(4个超线程),Ubuntu 18.04上运行此程序。
您的串行原始代码将得到:
elapsed for creation: 22.14 s
elapsed for processing: 16.78 s
因此,我的代码获得了 4倍的速度(大约与我的计算机具有真正的内核一样)。
这些数字用于50 MiB(512x512x200)数据。我还使用4 GiB(2048x2048x1000)进行了测试,时序从1500 s(串行)改善到366 s(并行)。