如何在字节数组上并行化迭代?

时间:2018-12-10 10:54:03

标签: python arrays multiprocessing byte python-multiprocessing

我为一个程序编写了代码,该程序可以处理带有“字节”数据的超大文件(例如,下面的代码中4GB表示x = 2048,y = 2048,时间= 1000)。在某些情况下,它最多可能有16GB的文件。我认为 absolute_bytearray(data)可以通过多处理至少加速四倍(因为当我运行程序时,仅加载了大约28%的CPU):

How to Multi-thread an Operation Within a Loop in Python

如何以正确的方式对我的代码应用多重处理?

from time import perf_counter
from random import getrandbits

x = 512
y = 512
time = 200

xyt = x*y*time

my_by = bytearray(getrandbits(8) for x in range(xyt))

def absolute_bytearray(data):
    for i in range(len(data)):
        if data[i] > 127:
            data[i] = 255 - data[i]
    return data

start = perf_counter()
absolute_bytearray(my_by)
end = perf_counter()
print('time abs my_by = %.2f' % (end - start))  # around 6,70s for 512*512*200

或者您也许知道更快的解决方案?

1 个答案:

答案 0 :(得分:1)

由于您在此处处理大量数据,因此使用共享内存将是一个不错的选择,可以在并行处理作业的同时保持较低的内存占用量。 multiprocessing模块i.a.针对这种情况提供Array

  

多重处理。数组(类型码或类型,大小或初始化器,*,锁=真)

     
    

返回从共享内存分配的ctypes数组。默认情况下,返回值          实际上是该数组的同步包装器。 docs

  

下面的代码还使用多个过程来创建数据。请从我的答案here中获取mp_utils模块的代码。其中的两个函数用于在共享数组的索引上创建“合理”范围。 这些batch_ranges发送到工作进程,并且每个进程将在这些范围内包含的索引的共享数组上工作。

import random
import ctypes
from time import perf_counter
from multiprocessing import Process, Array

from mp_utils import calc_batch_sizes, build_batch_ranges


def f(data, batch_range):
    """Target processing function."""
    for i in batch_range:
        if data[i] > 127:
            data[i] = 255 - data[i]


def create_data(array, batch_range):
    """Fill specified range of array with random bytes."""
    rd = random.Random(42)  # arbitrary seed 42
    getrandbits = rd.getrandbits  # for speed
    for i in batch_range:
        array[i] = getrandbits(8)


def process_tasks(target, tasks):
    """Process tasks by starting a new process per task."""
    pool = [Process(target=target, args=task) for task in tasks]

    for p in pool:
        p.start()
    for p in pool:
        p.join()


def main(x, y, time, n_workers):

    xyt = x * y * time

    # creating data
    creation_start = perf_counter()  # ----------------------------------------
    # We don't need a lock here, because our processes operate on different
    # subsets of the array.
    sha = Array(ctypes.c_ubyte, xyt, lock=False)  # initialize zeroed array
    batch_ranges = build_batch_ranges(calc_batch_sizes(len(sha), n_workers))
    tasks = [*zip([sha] * n_workers, batch_ranges)]

    process_tasks(target=create_data, tasks=tasks)
    print(f'elapsed for creation: {perf_counter() - creation_start:.2f} s')  #-
    print(sha[:30])

    # process data
    start = perf_counter()  # -------------------------------------------------
    process_tasks(target=f, tasks=tasks)
    print(f'elapsed for processing: {perf_counter() - start:.2f} s')  # -------
    print(sha[:30])


if __name__ == '__main__':

    N_WORKERS = 8
    X = Y = 512
    TIME = 200

    main(X, Y, TIME, N_WORKERS)

示例输出:

elapsed for creation: 5.31 s
[163, 28, 6, 189, 70, 62, 57, 35, 188, 26, 173, 189, 228, 139, 22, 151, 108, 8, 7, 23, 55, 59, 129, 154, 6, 143, 50, 183, 166, 179]
elapsed for processing: 4.36 s
[92, 28, 6, 66, 70, 62, 57, 35, 67, 26, 82, 66, 27, 116, 22, 104, 108, 8, 7, 23, 55, 59, 126, 101, 6, 112, 50, 72, 89, 76]

Process finished with exit code 0

我正在SandyBridge(2012)计算机,8核(4个超线程),Ubuntu 18.04上运行此程序。

您的串行原始代码将得到:

elapsed for creation: 22.14 s
elapsed for processing: 16.78 s

因此,我的代码获得了 4倍的速度(大约与我的计算机具有真正的内核一样)。

这些数字用于50 MiB(512x512x200)数据。我还使用4 GiB(2048x2048x1000)进行了测试,时序从1500 s(串行)改善到366 s(并行)。