Question

我有两个非常大的列表，一个是331991个元素长，我们称之为一个，另一个是99171个元素长，称之为b。我想比较a到b然后返回一个不在b中的元素列表。这也需要尽可能高效，并且按照它们出现的顺序，这可能是给定的，但我想我也可以把它放在那里。

Answer 1

可以在O（m + n）时间内完成，其中m和n对应于两个列表的长度：

exclude = set(b)  # O(m)

new_list = [x for x in a if x not in exclude]  # O(n)

这里的关键是集合具有恒定时间的包含测试。也许您可以考虑让b成为一个开头的集合。

另请参阅：List Comprehension

使用your example：

>>> a = ['a','b','c','d','e']
>>> b = ['a','b','c','f','g']
>>> 
>>> exclude = set(b)
>>> new_list = [x for x in a if x not in exclude]
>>> 
>>> new_list
['d', 'e']

Answer 2

让我们假设：

book = ["once", "upon", "time", ...., "end", "of", "very", "long", "story"]
dct = ["alfa", "anaconda", .., "zeta-jones"]

并且您想从书目列表中删除dct。

中存在的所有项目

快速解决方案：

short_story = [word in book if word not in dct]

加速dct中的搜索：将dct转换为set - 这样可以更快地查找：

dct = set(dct)
short_story = [word in book if word not in dct]

如果这本书很长并且不适合记忆，你可以逐字处理。为此，我们可以使用生成器：

def story_words(fname):
"""fname is name of text file with a story"""
  with open(fname) as f:
    for line in f:
      for word in line.split()
        yield word

#print out shortened story
for word in story_words("alibaba.txt"):
  if word not in dct:
    print word

如果你的字典太大了，你就不得不放弃速度并迭代字典的内容。但是我现在跳过了。

Answer 3

以下是将b转换为集合，然后过滤a中不存在的元素的一种方法：

from itertools import ifilterfalse

a = ['a','b','c','d','e']
b = ['a','b','c']
c = list(ifilterfalse(set(b).__contains__, a))
# ['d', 'e']

找到两个非常大的列表之间的差异

3 个答案: