Question

我为我正在编写的程序编写了3个单独的测试用例。不幸的是，我填写了我的硬盘，可能大约300+ gbs。我想从每个测试用例文件中提取样本，并删除文件的其余部分。

我知道如何使用＆＃39; readline＆＃39;来读取就地线而不消耗内存，所以我可以从每个文件中取一行并将其放入一个新文件中，然后将文件索引指向下一个文件而不是第一行，从而释放存储空间。

这可以使用python库吗？

编辑：取出sed，创建临时文件

Answer 1

我想从每个测试用例文件中提取样本，并删除文件的其余部分。

从顶部逐行阅读。将要保留的片段写入文件的开头。保持样本在文件中结束的当前偏移量和未读部分开始的位置。

如果复制的作品可能与目的地重叠;使用类似于memmove(fp+current_offset, fp+unread_offset, count)的算法：“从缓冲区的开头向前复制字节”。复制后：

current_offset += count
unread_offset += count

继续，直到获取足够的样本，然后调用file.truncate(current_offset)删除文件中样本后的所有内容。

例如，如果你想保持一半的线是随机的：

#!/usr/bin/env python
import random

with open('big-big.file', 'r+b') as file:
    current_offset = file.tell()
    while True:
        line = file.readline() # b'\n'-separated lines
        if not line: # EOF
            break
        if random.random() < 0.5: # keep the line
            unread_offset = file.tell()
            file.seek(current_offset)
            file.write(line)
            current_offset = file.tell()
            file.seek(unread_offset)
    file.truncate(current_offset)

Answer 2

尝试使用mmap重写文件。我不想填满我的硬盘，所以没有测试任何大的。它是一个可运行的例子，但是你想要把我用来测试的前后东西都去掉。

这会将文件的前50％写入新文件然后修剪它。不确定这是否是您想要的订单！

import mmap

import shutil
import os
from glob import glob

files_to_trim = 'deleteme*'
fraction_to_keep = .5
blocksize = 128*1024

# make test file
open('deleteme1', 'w').writelines('all work and no play {}\n'.format(i)
    for i in range(6))
open('deleteme2', 'w').writelines('all work and no play {}\n'.format(i)
    for i in range(10,18))


with open('output', 'wb') as out:
    for filename in sorted(glob(files_to_trim)):
        st_size = os.stat(filename).st_size
        sample_size = int(st_size * fraction_to_keep)
        with open(filename, 'r+b') as infile:
            memfile = mmap.mmap(infile.fileno(), 0)
            # find next line ending
            need_newline = False
            count = memfile.find(b'\n', sample_size)
            if count >= 0:
                count += 1 # account for \n
            else:
                count = st_size
                need_newline = memfile[-1] == '\n'
            # copy blocks to outfile
            for rpos in range(0, count+blocksize-1, blocksize):
                out.write(memfile[rpos:min(rpos+blocksize, count)])
            if need_newline:
                out.write('\n')
            # trim infile
            remaining = st_size - count
            memfile[:remaining] = memfile[count:]
            memfile.flush()
            memfile.close()
            infile.truncate(remaining)
            infile.flush()

# validate test file
print('deleteme1:')
print(open('deleteme1').read())
print('deleteme2:')
print(open('deleteme2').read())
print('output:')
print(open('output').read())

如何删除文件中的行＆＃39;就地＆＃39;没有创建临时文件？

2 个答案: