Question

我有一组字符串：例如 {'Type A', 'Type B', 'Type C'}，我将其称为 x。该集合最多可以有 10 个字符串。

还有一个很大的集合列表，例如 [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C, 'Type D'}, {'Type E', 'Type F', 'Type G'}] 等等。

我的目标是返回大列表中包含 60% 或更多与 x 相同元素的所有集合。所以在这个例子中，它会返回前 3 个集合而不是第 4 个。

我知道我可以迭代每个集合，比较元素，然后使用相似性的数量来处理我的业务，但这非常耗时，我的大列表可能会有很多集合。有没有更好的方法来解决这个问题？我想过使用 frozenset() 并对它们进行散列，但我不确定我会使用什么散列函数，以及我将如何比较散列。

任何帮助将不胜感激 - 非常感谢！

Answer 1

l = [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C', 'Type D'}, {'Type E', 'Type F', 'Type G'}]

x = {'Type A', 'Type B', 'Type C'}

for s in l:
    print (len(x.intersection(s)))

输出：

返回一个函数和一个元组列表：

def more_than(l,n):
    return [ (s,round(len(x.intersection(s))/len(x),2)) for s in l if len(x.intersection(s))/len(x) > n]
 
print (more_than(l,0.6))

输出：

[({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type C', 'Type D'}, 0.67)]

这里，为了方便起见，我使用了 round(len(x.intersection(s))/len(x),2) 转换为 round(x,y)。 round() 会将您的比率四舍五入到使用 y 变量提到的小数位数。

Answer 2

这个怎么样？

x = {'Type A', 'Type B', 'Type C'}
lst = [{'Type A', 'Type B', 'Type C'}, 
       {'Type A', 'Type B', 'Type C'}, 
       {'Type B', 'Type C', 'Type D'},
       {'Type E', 'Type F', 'Type G'}]    
[s for s in lst if len(s.intersection(x)) > len(x) * 0.6]

Python - 比较集合并返回具有最匹配元素的集合

2 个答案: