Question

我有一个非常大的文件，我想阅读并做一些事情。根据我的代码，我已经分配了1024个字节来读取然后循环，直到读取所有内容。但有时我这样做会被截断。

即使我提到要阅读的尺寸，我也要确保它正在阅读一个完整的单词。我的所有单词都是用空格分隔的。

with open('test.txt', mode='r',encoding="utf-8") as f:

          chunk_size = 1024

          f_chunk = f.read(chunk_size)

          while len(f_chunk)>0:

              for word in f_chunk.split():
                #do something  
                print (word)
              f_chunk = f.read(chunk_size)

Answer 1

我不知道是否有内置方式，但您可以尝试以下方式：

chunk_size = 1024
data = ''
while True:
    data += f.read(chunk_size)
    if not data:
        break
    last_sp = data.rfind(' ')
    if last_sp == -1:                # No space at the end
        last_sp = len(data)
    block = data[:last_sp]
    data = data[last_sp + 1:]

    for word in block.split():
        print(word)

基本上，你记得下一个块的最后一个块的结束 - 如果你的字大于你的块大小，这将不起作用，如果你有一个单独的空格以外的分隔符可能不会（例如' ' ）。

Answer 2

作为替代方法，您可以按如下方式创建一个字生成器：

def read_word(f):
    word = []
    c = '.'

    while c:
        c = f.read(1)

        if c.isalnum():
            word.append(c)
        elif len(word):
            yield ''.join(word)
            word = []

    yield ''.join(word)

with open('input.txt') as f_input:
    for word in read_word(f_input):
        print(word)

这将根据是否使用isalnum()的字母数字字符返回整个单词。所以read_word()也会删除所有空格。

例如，如果包含input.txt：

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hoc loco tenere se Triarius non potuit.

输出结果为：

Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
Hoc
loco
tenere
se
Triarius
non
potuit

在不截断单词的情况下读取文件

2 个答案: