在空白行的X行后拆分文件

时间:2017-02-28 21:33:17

标签: python

我需要将大型文本文件拆分成较小的块,其中文本文件包含需要保持在一起的数据。每个相关的数据块都通过换行符与下一行分开,如下所示:

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

我怎样才能定义一些行,然后在下一个空白行维护数据块,文件被拆分?我想使用Python,但我不知道在 X行之后使用分割函数

2 个答案:

答案 0 :(得分:2)

from itertools import groupby

with open(myfile, 'r') as f:
    chunks = [[x.strip() for x in v] for k, v in 
              groupby(f, lambda x: x.strip()) if k]

答案 1 :(得分:0)

如果你想为每个chunk编写新的chunk1.txt ... chunkN.txt,你可以这样做:

def chunk_file(name, lines_per_chunk, chunks_per_file):

    def write_chunk(chunk_no, chunk):
        with open("chunk{}.txt".format(chunk_no), "w") as outfile:
            outfile.write("".join(i for i in chunk))

    count, chunk_no, chunk_count, chunk = 1, 1, 0, []
    with open(name, "r") as f:
        for row in f:
            if count > lines_per_chunk and row == "\n":
                chunk_count += 1
                count = 1
                chunk.append("\n")
                if chunk_count == chunks_per_file:
                    write_chunk(chunk_no, chunk)
                    chunk = []
                    chunk_count = 0
                    chunk_no += 1
            else:
                count += 1
                chunk.append(row)
    if chunk:
        write_chunk(chunk_no, chunk)

chunk_file("test.txt", 3, 1)

您必须指定属于块的行,之后会出现换行符。

假设您要将此文件分块:

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

第一个块的行数与第二个块的强烈不同。 (7行对3行)

此示例的输出为 chunk1.txt

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

chunk2.txt

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

此方法假设lines_per_chunk是最小块大小,因此即使块具有不同的行数,它也能正常工作。当达到最小块大小时,我们只寻找一个空行来结束块。 在上面的例子中,没有问题,第2行有一个空行,因为尚未达到最小块大小。如果第4行出现空白行并且块数据随后继续,则会出现问题,因为指定的标准(行号和空行)无法单独识别块。