Question

我经常使用PowerShell将较大的文本或csv文件拆分为较小的文件，以便加快处理速度。但是，我有一些文件是通常的格式。这些基本上是打印文件到文本文件。每条记录都以一行开头，以1开头，行上没有其他内容。

我需要做的是根据语句数拆分文件。所以，基本上如果我想将文件拆分为3000个语句的块，我会向下看，直到我在位置1看到3001出现1并将之前的所有内容复制到新文件中。我可以从Windows，Linux或OS X运行它，所以几乎任何东西都可以进行拆分。

非常感谢任何想法。

Answer 1

也许可以尝试通过以下事实来识别它：＆＃39; 1＆＃39;加上新的一行？

with open(input_file, 'r') as f:
    my_string = f.read()

my_list = my_string.split('\n1\n')

假设它具有以下格式，将每个记录分成一个列表：

1
....
....
1
....
....
....

然后，您可以将列表中的每个元素输出到单独的文件中。

for x in range(len(my_list)):
    print >> str(x)+'.txt', my_list[x]

Answer 2

为避免将文件加载到内存中，您可以定义一个以递增方式生成记录的函数，然后使用itertool's grouper recipe将每个3000条记录写入新文件：

#!/usr/bin/env python3
from itertools import zip_longest

with open('input.txt') as input_file:
    files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
    for n, records in enumerate(files):
        open('output{n}.txt'.format(n=n), 'w') as output_file:
            output_file.writelines(''.join(lines)
                                   for r in records for lines in r)

其中generate_records()一次产生一条记录，其中记录也是输入文件中的行的迭代器：

from itertools import chain

def generate_records(input_file, start='1\n', eof=[]):
    def record(yield_start=True):
        if yield_start:
            yield start
        for line in input_file:
            if line == start: # start new record
                break
            yield line
        else: # EOF
            eof.append(True)
    # the first record may include lines before the first 1\n
    yield chain(record(yield_start=False), 
                record())
    while not eof:
        yield record()

generate_records()是一个生成itertools.groupby()生成器的生成器。

出于性能原因，您可以一次读/写多行的块。

根据行的位置1中出现次数1分割文件

2 个答案: