Question

我有一个字符串列表（像这样的单词），当我解析文本时，我需要检查一个单词是否属于我当前列表中的单词组。

但是，我的输入非常大（大约6亿行），根据Python文档检查元素是否属于列表是O（n）操作。

我的代码类似于：

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

因为它需要太多时间（实际上是几天），所以我希望改进大部分时间都在使用的部分。我看看Python集合，更准确地说，看看deque。但是，只允许O（1）操作时间访问列表的头部和尾部，而不是在中间。

有人知道如何以更好的方式做到这一点吗？

Answer 1

您可以考虑使用trie或DAWG或数据库。有几个相同的Python实现。

以下是您考虑集合与列表的相关时间：

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

打印：

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

即，将一组10000个单词与一组250,000个单词匹配 17,085 X快比匹配相同250,000个单词的列表中的相同10000个单词的列表相比。使用一个源列表和一个用于成员资格测试的集合比单独的未排序列表 28,392 X更快。

对于成员资格测试，列表为O（n），集合和字典为查找的O（1）。

结论：对6亿行文本使用更好的数据结构！

Answer 2

这使用list comprehension

words_in_line = [word for word in line if word in my_list]

这比你发布的代码更有效率，但你的庞大数据集还有多少难以知道。

Answer 3

我不清楚你为什么首先选择一个列表，但这里有一些选择：

使用set（）可能是个好主意。这是非常快的，虽然无序，但有时这正是所需要的。

如果你需要订购的东西并且也有任意查找，你可以使用某种树： http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

如果在此处使用少量误报设置成员资格测试或者可以接受，则可以检查布隆过滤器： http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

根据你正在做的事情，特里也可能非常好。

Answer 4

你可以在这里进行两项改进。

使用哈希表返回单词列表。当您检查单词列表中是否存在单词时，这将为您提供O（1）性能。有很多方法可以做到这一点;在这种情况下最合适的是将您的列表转换为集合。
为匹配字汇集使用更合适的结构
- 如果您需要同时将所有匹配项存储在内存中，请使用dequeue，因为其追加性能优于列表。
- 如果您不需要同时在内存中匹配所有匹配项，请考虑使用生成器。生成器用于根据您指定的逻辑迭代匹配的值，但它一次只将结果列表的一部分存储在内存中。如果遇到I / O瓶颈，它可能会提供更好的性能。

下面是基于我的建议的示例实现（选择生成器，因为我无法想象你一次需要内存中的所有这些单词）。

from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
    # Do something with matched_word
    print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()

<强> input.txt中

a b dog cat
c dog poop
maybe b cat
dog

<强>输出

a
b
c
b

Python：如何有效地检查项目是否在列表中？

4 个答案: