Question

比较两个列表的最有效方法是什么，只保留列表A中的元素而不是非常大的数据集中的B？

示例：

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

已经somewhat similar questions但我正在处理1M +单词和过滤器，导致正则表达式重载。我曾经用while-loops做一个简单的'filters [i] in words [j]'test，但这看起来非常低效。

Answer 1

您可以设置过滤器

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']

这比你的filters[i] in words[j] 更好，因为如果过滤列表中有“平滑”，它就不会过滤“冰沙”

Answer 2

我尝试了稍微修改过的@gnibbler版本：它使用set operation intersection 而不是list comprehension。我相信这个版本要快一点。

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']

性能：比较python中的两个列表以进行字符串匹配

2 个答案: