Question

我正在寻找一个带有单词列表（wordlist）的函数，打开一个txt文件并返回一个未出现在txt文件中的单词列表。这就是我到目前为止......

def check_words_in_file(wordlist):
    """Return a list of words that don't appear in words.txt"""
    words = set()
    words = open("words.txt").read().splitlines()

    return [x for x in wordlist if x not in words]

我对此功能的问题是它太慢了。如果我使用由10,000个单词组成的单词表，则需要大约15秒才能完成。如果我使用300,000，它需要的时间比它应该的长。无论如何我可以更快地完成这个功能吗？

Answer 1

问题在于你理解变量的使用和与对象的关联，这在你写

时很明显

words = set()
words = open("words.txt").read().splitlines()

在第一行中，您最初创建一个空集对象，并将其引用与变量words相关联。稍后您打开文件并拆分内容的行，这将返回一个列表并使用列表重新绑定变量words

您可能打算写

words = set(open("words.txt").read().splitlines())

进一步改善

如果您创建一组参数wordlist并找到不对称的集差异，您实际上可以做得更好

words = set(wordlist).difference(open("words.txt").read().splitlines())
return list(words)

<强>挑剔

通常不建议打开文件并让文件句柄被垃圾收集。关闭文件或使用上下文管理器

with open("words.txt") as fin:
    from itertools import imap
    words = set(wordlist).difference(imap(str.strip, fin))
    return list(words)

将文件中的单词与列表进行比较的速度太慢了

1 个答案: