读取大型二进制文件(超过500 MB)的最快方法?

时间:2019-05-15 14:20:01

标签: python python-3.x binary

我想读取较大的二进制文件并分成6个字节的块。例如,现在我可以在82秒内读取1GB二进制文件,但是它太慢了。 达到最大速度的最佳方法是什么?

请注意,我不能使用struct。因为我选择的块不是2(6字节)的幂。

with open(file, "rb") as infile:
     data_arr = []
     start = time()
     while True:
         data = infile.read(6)
         if not data: break
         data_arr.append(data)

2 个答案:

答案 0 :(得分:1)

您有几种不同的选择。您的主要问题是,由于块的大小较小(6个字节),因此在获取块和进行垃圾收集时会花费大量开销。

有两种主要的解决方法:

  1. 将整个文件加载到内存中,然后将其分成多个块。这是最快的方法,但是文件越大,很有可能就会开始遇到MemoryErrors。

  2. 一次将一个块加载到内存中,对其进行处理,然后继续进行下一个块。总体而言,这并不是更快的方法,但是可以节省前期时间,因为您无需等待整个文件加载就可以开始处理。

  3. 组合使用1.和2.(将文件缓冲成大块并将其分成较小的块,以块大小的倍数加载文件,等等)进行实验。留给观看者练习,因为需要大量实验才能找到可以快速正确地工作的代码。

一些代码,带有时间比较:

import timeit


def read_original(filename):
    with open(filename, "rb") as infile:
        data_arr = []
        while True:
            data = infile.read(6)
            if not data:
                break
            data_arr.append(data)
    return data_arr


# the bigger the file, the more likely this is to cause python to crash
def read_better(filename):
    with open(filename, "rb") as infile:
        # read everything into memory at once
        data = infile.read()
        # separate string into 6-byte chunks
        data_arr = [data[i:i+6] for i in range(0, len(data), 6)]
    return data_arr

# no faster than the original, but allows you to work on each piece without loading the whole into memory
def read_iter(filename):
    with open(filename, "rb") as infile:
        data = infile.read(6)
        while data:
            yield data
            data = infile.read(6)


def main():
    # 93.8688215 s
    tm = timeit.timeit(stmt="read_original('test/oraociei12.dll')", setup="from __main__ import read_original", number=10)
    print(tm)
    # 85.69337399999999 s
    tm = timeit.timeit(stmt="read_better('test/oraociei12.dll')", setup="from __main__ import read_better", number=10)
    print(tm)
    # 103.0508528 s
    tm = timeit.timeit(stmt="[x for x in read_iter('test/oraociei12.dll')]", setup="from __main__ import read_iter", number=10)
    print(tm)

if __name__ == '__main__':
    main()

答案 1 :(得分:0)

这种方式要快得多。

import sys
from functools import partial

SIX = 6
MULTIPLIER = 30000
SIX_COUNT = SIX * MULTIPLIER

def do(data):
    for chunk in iter(partial(data.read, SIX_COUNT), b""):
        six_list = [ chunk[i:i+SIX] for i in range(0, len(chunk), SIX) ]

if __name__ == "__main__": 
    args = sys.argv[1:]
    for arg in args:
        with open(arg,'rb') as data:
            do(data)