性能:比较python中的两个列表以进行字符串匹配

时间:2014-02-05 07:59:00

标签: python regex performance

比较两个列表的最有效方法是什么,只保留列表A中的元素而不是非常大的数据集中的B?

示例:

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

已经somewhat similar questions但我正在处理1M +单词和过滤器,导致正则表达式重载。我曾经用while-loops做一个简单的'filters [i] in words [j]'test,但这看起来非常低效。

2 个答案:

答案 0 :(得分:2)

您可以设置过滤器

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']

这比你的filters[i] in words[j] 更好,因为如果过滤列表中有“平滑”,它就不会过滤“冰沙”

答案 1 :(得分:2)

我尝试了稍微修改过的@gnibbler版本:它使用set operation intersection 而不是list comprehension。我相信这个版本要快一点。

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']