Question

我在python中包含一个包含大文本文件的字符串（超过1MiB）。我需要把它拆分成块。

约束：

块只能由换行符和
len（chunk）必须尽可能大但小于LIMIT（即100KiB）

可以省略比LIMIT更长的行。

知道如何在python中很好地实现它吗？

提前谢谢你。

Answer 1

这是我不那么pythonic的解决方案：

def line_chunks(lines, chunk_limit):
    chunks = []
    chunk = []
    chunk_len = 0
    for line in lines:
        if len(line) + chunk_len < chunk_limit:
            chunk.append(line)
            chunk_len += len(line)
        else:
            chunks.append(chunk)
            chunk = [line]
            chunk_len = len(line)
    chunks.append(chunk)
    return chunks

chunks = line_chunks(data.split('\n'), 150)
print '\n---new-chunk---\n'.join(['\n'.join(chunk) for chunk in chunks])

Answer 2

根据Linuxios的建议，您可以使用rfind查找限制内的最后一个换行符并在此时进行拆分。如果没有找到换行符，则块太大而且可以被解除。

chunks = []

not_chunked_text = input_text

while not_chunked_text:
    if len(not_chunked_text) <= LIMIT:
        chunks.append(not_chunked_text)
        break
    split_index = not_chunked_text.rfind("\n", 0, LIMIT)
    if split_index == -1:
        # The chunk is too big, so everything until the next newline is deleted
        try:
            not_chunked_text = not_chunked_text.split("\n", 1)[1]
        except IndexError:
            # No "\n" in not_chunked_text, i.e. the end of the input text was reached
            break
    else:
        chunks.append(not_chunked_text[:split_index+1])
        not_chunked_text = not_chunked_text[split_index+1:]

rfind("\n", 0, LIMIT)返回在LIMIT范围内找到换行符的最高索引需要not_chunked_text[:split_index+1]以便新行字符包含在块

中

我将LIMIT解释为允许的最大块长度。如果不允许长度为LIMIT的块，则必须在此代码中-1之后添加LIMIT。

由新行分割的有限文本块

2 个答案: