如何在大字节类型的文件上并行化迭代?

时间:2019-01-28 15:31:50

标签: python multiprocessing byte python-multiprocessing

我所拥有的:字节文件,最大偏移量为16 GB(例如100字节)。 我需要什么:以最快的方式处理代码中的动作“ f”,例如我希望可以进行多处理。

我试图实现http://effbot.org/zone/wide-finder.htm这种方法。 那篇文章的多线程Python解决方案比原始代码慢两倍。我无法实现多处理器Python解决方案,因为我的python级别还不够好。我读了多处理模块描述,但对我没有帮助,我在代码方面遇到了一些问题...

from time import perf_counter
from random import getrandbits


def create_byte_data(size):
    creation_start = perf_counter()
    my_by = bytes(getrandbits(8) for i in range(size))  # creates 50MB random byte data
    print('creation my_by time = %.1f' % (perf_counter() - creation_start))
    return my_by


def write_to_file(file, data, b):
    writing_start = perf_counter()
    with open(file, "wb") as f:  # binary file creation
        for a in range(offset):
            f.write(b'0')
        # for n in range(b):  # for creating bigger files
        #     f.write(data)
        f.write(data)
    print('writing time = %.1f' % (perf_counter() - writing_start))


def abs_pixel(pixel: bytes) -> int:  # converting signed bytes to absolute (0 +127) values, and collection sum of them to "result"
    result = 0
    for a in pixel:
        if a > 127:
            result += 256 - a
        else:
            result += a
    return result    


def f(file, offset, time):  # this function must be accelerated
    sum_list = list()
    with open(file, "rb") as f:
        f.seek(offset)
        while True:
            chunk = f.read(time)
            if not chunk:
                break
            sum_list.append(abs_pixel(chunk))
    return sum_list


if __name__ == '__main__':
    filename = 'bytes.file'
    offset = 100
    x = 512
    y = 512
    time = 200
    fs = 2  # file size in GBytes  # for creating bigger files
    xyt = x * y * time
    b = fs*1024*1024*1024//xyt  # parameter for writing data file of size 'fs'
    my_data = create_byte_data(xyt)  # don't needed after created ones
    write_to_file(filename, my_data, b)  # don't needed after created ones
    start = perf_counter()
    result = f(filename, offset, time)  # this function must be accelerated
    print('function time = %.1f' % (perf_counter() - start))
    print(result[:10])

任务:使用块(长度为“时间”)进行一些数学运算,并将结果收集到列表中。文件可能很大,因此RAM一定不能过载。 上面的代码可以创建随机字节文件(开始时为50 Mb,或在进一步测试时为更大)。与上面的代码相比,我预计运行功能“ f”至少要加速4倍。实际上,对于50 MB字节的文件,在我的电脑上大约需要6秒钟,而对于2 GB字节的文件,则需要大约240秒。

1 个答案:

答案 0 :(得分:0)

我发现了如何并行化代码mith多处理的某些部分,以及如何使abs_pixel更快地工作。现在,代码的运行速度提高了2倍(例如,在我的PC上,每100 MB的运行速度为6,1s vs.11,9s,而对于2 GB的运行速度则为119s vs 246s)。

from multiprocessing import Pool
from struct import unpack_from


def abs_pixel_2(pixel: bytes) -> int:  # integral abs
    a = unpack_from('<%ib' % len(pixel), pixel)
    return sum(map(abs, a))


def f_mp(file, offset, time):  # read in lines
    sum_list = list()
    p = Pool()
    with open(file, "rb") as f:
        f.seek(offset)
        for z in range(y*2):  # (y*2) for 100 MByte file (2 times original 50 MBytes created data)
            line = list()
            for i in range(x):  # read a line
                pixel = f.read(time)
                line.append(pixel)
            sums_line = p.map(abs_pixel_2, line)  # line with sums
            # sums_line = list(map(abs_pixel, line))  # line with sums without using of Pool
            sum_list.append(sums_line)  # -> list of lines (lists)
        sum_list = [item for sublist in sum_list for item in sublist]  # flatten of list
        p.close()
        p.join()
    return sum_list

但是我仍然希望找到更多的加速点。