Question

我有多个文件，每个文件都有一行，比如每个~10M。我想检查每个文件，并为每个重复数字的文件打印0，为每个不重复的文件打印1。

我正在使用列表来计算频率。由于每行的数字量很大，我想在接受每个数字后更新频率，并在找到重复的数字后立即中断。虽然这在C中很简单，但我不知道如何在Python中执行此操作。

如何在不存储（或作为输入）整行的情况下以逐字方式输入一行？

编辑：我还需要一种方法来实时输入而不是文件。

Answer 1

读取该行，拆分该行，将数组结果复制到一个集合中。如果集合的大小小于数组的大小，则文件包含重复的元素

with open('filename', 'r') as f:
    for line in f:
        # Here is where you do what I said above

要逐字阅读文件，请尝试此

import itertools

def readWords(file_object):
    word = ""
    for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
        if ch.isspace():
            if word: # In case of multiple spaces
                yield word
                word = ""
            continue
        word += ch
    if word:
        yield word # Handles last word before EOF

然后你可以这样做：

with open('filename', 'r') as f:
    for num in itertools.imap(int, readWords(f)):
        # Store the numbers in a set, and use the set to check if the number already exists

此方法也适用于流，因为它一次只读取一个字节，并从输入流输出单个空格分隔的字符串。

在给出这个答案之后，我已经更新了这个方法了。看看

<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>

Answer 2

我猜你的方式是不可能的。你不能在python中逐字逐句阅读。可以做到这一点：

f = open('words.txt')
for word in f.read().split():
    print(word)

如何在Python中逐字输入一行？

2 个答案: