Question

检查很多文件中是否有零块的最快方法是什么。这些块应该大于32000字节的零。以下代码将放慢速度：

empty_blocks = []
min_length = 32000
block = False
begin = -1
data = open(file_name,'rb').read()
for i,byte in enumerate(data):
        byte = ord(byte)
        if byte == 0x00 and block == False:
            block = True
            begin = i
        elif byte != 0x00 and block == True:
            block = False
            if length >= min_length:
                empty_blocks.append((begin, i - begin))
            begin = -1

Answer 1

因此，假设块大小为32768字节，我想出了一些东西：

from functools import partial

BLOCKSIZE = 32 * 1024

with open('testfile.bin', 'rb') as f:
    for block_number, data in enumerate(iter(partial(f.read, BLOCKSIZE), b'')):
        if not any(data):
            print('Block #{0} is empty!'.format(block_number))

~~sum()是确定序列中每个字节的值是否为零的最快方法。我认为不可能比O(n)更快。~~
VPfB建议使用非常快的any()，因为它终止于第一个非零元素，而不是遍历整个序列。

输出示例：

Block #0 is empty!
Block #100 is empty!
Block #200 is empty!

在我的机器上处理大约 ~~~100 MB / sec~~ 2 GB / s，这是我希望的速度。

Answer 2

首先制作文件：

import mmap, os, re
f = open(filename)
m = mmap.mmap(f.fileno(), os.fstat(f.fileno()).st_size, prot=mmap.PROT_READ)

使用正则表达式很方便：

for match in re.findall(b'\0{32768}, m):
    print(match.start())

但字符串更快：

z32k = '\0' * 32768
start = 0
while True:
    start = m.find(z32k, start)
    if start < 0:
        break
    print(start)

只有32k对齐的块：

for match in re.finditer('.{32768}', m, re.DOTALL):
    if max(match.group()) == '\0':
        print(match.start())

在文件中找到大于32KB的零块

2 个答案: