Question

如果我有一个包含10,000个单词的列表，那么检查某个单词是否在该列表中并且不会让应用程序慢速爬行的优化方法是什么？

我应该从文件中加载单词并检查吗？

def check_for_word(word):
    HUGE_LIST = [...]  # 10,000 Words
    if word in HUGE_LIST:
         return True
    else:
         return False

Answer 1

将列表转换为set - 字符串是可哈希的，因此可以轻松创建set。

set中的查找是O（1），其中列表是O（n），其中n是列表的长度。

HUGE_SET = set(HUGE_LIST)   # or frozenset, if it's constant and words won't be added to it
return word in HUGE_SET

另外，考虑在函数体外移动巨大列表和巨大集的创建。现在，每次调用函数时都会重新创建列表。

列表时间：

$ python -m timeit -s "words = list(map(str, xrange(10000)))" -n 10000 "'5000' in words"
10000 loops, best of 3: 58.2 usec per loop

冻结时间安排：

$ python -m timeit -s "words = frozenset(map(str, xrange(10000)))" -n 10000 "'5000' in words"
10000 loops, best of 3: 0.0504 usec per loop

Answer 2

如果您不在列表中进行任何修改，请使用tuple而不是列表。
如果列表中的项目是唯一的，那么最好使用set。

与列表中的查找

相比，使用元组/设置查找会更快

Answer 3

从文件中读取单词并将其转换为set个单词。检查集合的成员资格非常快（10,000不是＆＃34;非常大＆＃34;： - ））。

with open('words.txt') as words:
    wordset = {word.strip() for word in words}

return word in wordset

（虽然如果你不必每次都阅读它会有所帮助，但要将它保存在一个变量中 - 每次构建该集合需要的时间比检查一个单词是否以原始方式存在）

Answer 4

你可能想采取一个稍微间接的路线，写一个函数，给定一个包含你所有单词的文件，返回一个检查成员资格的函数

def make_checker(fname):
    with open(fname) as f:
        # Hp: one word per line, you can adjust the code for a different format
        # this line builds a set by a _set comprehension_
        words = {word.strip() for word in f}
    def the_checker(word):
        return word in words
    return the_checker

你可以像这样使用

check_4 = make_checker('corpus_of_4_letter_words.txt')
...
if check_4(answer.strip()):
    print('Please, please do not use these words.')

检查非常大的列表中的单词

4 个答案: