我需要将大型文本文件拆分成较小的块,其中文本文件包含需要保持在一起的数据。每个相关的数据块都通过换行符与下一行分开,如下所示:
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
我怎样才能定义一些行,然后在下一个空白行维护数据块,文件被拆分?我想使用Python,但我不知道在 X行之后使用分割函数。
答案 0 :(得分:2)
from itertools import groupby
with open(myfile, 'r') as f:
chunks = [[x.strip() for x in v] for k, v in
groupby(f, lambda x: x.strip()) if k]
答案 1 :(得分:0)
如果你想为每个chunk编写新的chunk1.txt ... chunkN.txt,你可以这样做:
def chunk_file(name, lines_per_chunk, chunks_per_file):
def write_chunk(chunk_no, chunk):
with open("chunk{}.txt".format(chunk_no), "w") as outfile:
outfile.write("".join(i for i in chunk))
count, chunk_no, chunk_count, chunk = 1, 1, 0, []
with open(name, "r") as f:
for row in f:
if count > lines_per_chunk and row == "\n":
chunk_count += 1
count = 1
chunk.append("\n")
if chunk_count == chunks_per_file:
write_chunk(chunk_no, chunk)
chunk = []
chunk_count = 0
chunk_no += 1
else:
count += 1
chunk.append(row)
if chunk:
write_chunk(chunk_no, chunk)
chunk_file("test.txt", 3, 1)
您必须指定属于块的行,之后会出现换行符。
假设您要将此文件分块:
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
第一个块的行数与第二个块的强烈不同。 (7行对3行)
此示例的输出为 chunk1.txt :
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
chunk2.txt :
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
此方法假设lines_per_chunk是最小块大小,因此即使块具有不同的行数,它也能正常工作。当达到最小块大小时,我们只寻找一个空行来结束块。 在上面的例子中,没有问题,第2行有一个空行,因为尚未达到最小块大小。如果第4行出现空白行并且块数据随后继续,则会出现问题,因为指定的标准(行号和空行)无法单独识别块。