您好我有一个包含此数据结构的文件。
for each 3073 bytes:
<1 x label><3072 x pixel>
...
<1 x label><3072 x pixel>
the lable is between 0~9
现在我需要编写一个python脚本来读取文件并检查每个3073字节。如果label为“1”,则删除此3073字节(标签和像素)
ex: 2 <1st 3072 bytes> 1 <2nd 3072 bytes> 9 <3rd 3072 bytes>....
after run the script:
output: 2 <1st 3072 bytes> 9 <3rd 3072 bytes>....
现在我的解决方案是
1. use loop check every 3073 bytes
if the label is 1:
then put the index to buffer
2. make a new file
loop each 3073 bytes
if this 3073 bytes index is in the buffer
then skip
但我发现这是非常低效的。那么还有其他更智能的解决方案吗?
答案 0 :(得分:1)
这应该相当快(对于150MB文件最多几秒钟)并且永远不会在内存中保存太多数据:
chunk_size = 3072
with open('newpixels.bin', 'wb') as new_file:
with open('pixels.bin', 'rb') as data:
while True:
label_and_pixels = data.read(1+chunk_size)
if not label_and_pixels:
break
elif label_and_pixels[0] != '1':
new_file.write(label_and_pixels)
以pixels.bin
作为输入:
1XXX2YYY2ZZZ3AAA1BBB2CCC
且chunk_size
设置为3
,输出:
2YYY2ZZZ3AAA2CCC
如果您确定算法正确并且输出数据正常,则可以删除'pixels.bin'
并在脚本末尾将'newpixels.bin'
重命名为'pixels.bin'
。
答案 1 :(得分:0)
以下算法可能会更好一点:
1. use loop to check all 3073 bytes
if the label is 1:
continue
else:
write byte to new file (?)