Question

您好我有一个包含此数据结构的文件。

for each 3073 bytes:
<1 x label><3072 x pixel>
...
<1 x label><3072 x pixel>
the lable is between 0~9

现在我需要编写一个python脚本来读取文件并检查每个3073字节。如果label为“1”，则删除此3073字节（标签和像素）

ex: 2 <1st 3072 bytes> 1 <2nd 3072 bytes> 9 <3rd 3072 bytes>....
after run the script:
    output:  2 <1st 3072 bytes> 9 <3rd 3072 bytes>....

现在我的解决方案是

1. use loop check every 3073 bytes 
   if the label is 1:
       then put the index to buffer
2. make a new file
   loop each 3073 bytes
   if this 3073 bytes index is in the buffer
       then skip

但我发现这是非常低效的。那么还有其他更智能的解决方案吗？

Answer 1

这应该相当快（对于150MB文件最多几秒钟）并且永远不会在内存中保存太多数据：

chunk_size = 3072

with open('newpixels.bin', 'wb') as new_file:
    with open('pixels.bin', 'rb') as data:
        while True:
            label_and_pixels = data.read(1+chunk_size)
            if not label_and_pixels:
                break
            elif label_and_pixels[0] != '1':
                new_file.write(label_and_pixels)

以pixels.bin作为输入：

1XXX2YYY2ZZZ3AAA1BBB2CCC

且chunk_size设置为3，输出：

2YYY2ZZZ3AAA2CCC

如果您确定算法正确并且输出数据正常，则可以删除'pixels.bin'并在脚本末尾将'newpixels.bin'重命名为'pixels.bin'。

Answer 2

以下算法可能会更好一点：

1. use loop to check all 3073 bytes
   if the label is 1:
       continue
   else:
       write byte to new file (?)

如何有效删除大文件中的指定字符串？

2 个答案: