Question

我需要将大型文本文件拆分成较小的块，其中文本文件包含需要保持在一起的数据。每个相关的数据块都通过换行符与下一行分开，如下所示：

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

我怎样才能定义一些行，然后在下一个空白行维护数据块，文件被拆分？我想使用Python，但我不知道在 X行之后使用分割函数。

Answer 1

from itertools import groupby

with open(myfile, 'r') as f:
    chunks = [[x.strip() for x in v] for k, v in 
              groupby(f, lambda x: x.strip()) if k]

Answer 2

如果你想为每个chunk编写新的chunk1.txt ... chunkN.txt，你可以这样做：

def chunk_file(name, lines_per_chunk, chunks_per_file):

    def write_chunk(chunk_no, chunk):
        with open("chunk{}.txt".format(chunk_no), "w") as outfile:
            outfile.write("".join(i for i in chunk))

    count, chunk_no, chunk_count, chunk = 1, 1, 0, []
    with open(name, "r") as f:
        for row in f:
            if count > lines_per_chunk and row == "\n":
                chunk_count += 1
                count = 1
                chunk.append("\n")
                if chunk_count == chunks_per_file:
                    write_chunk(chunk_no, chunk)
                    chunk = []
                    chunk_count = 0
                    chunk_no += 1
            else:
                count += 1
                chunk.append(row)
    if chunk:
        write_chunk(chunk_no, chunk)

chunk_file("test.txt", 3, 1)

您必须指定属于块的行，之后会出现换行符。

假设您要将此文件分块：

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

第一个块的行数与第二个块的强烈不同。（7行对3行）

此示例的输出为 chunk1.txt ：

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

chunk2.txt ：

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

此方法假设lines_per_chunk是最小块大小，因此即使块具有不同的行数，它也能正常工作。当达到最小块大小时，我们只寻找一个空行来结束块。在上面的例子中，没有问题，第2行有一个空行，因为尚未达到最小块大小。如果第4行出现空白行并且块数据随后继续，则会出现问题，因为指定的标准（行号和空行）无法单独识别块。

在空白行的X行后拆分文件

2 个答案: