给定集合列表(诸如setlist = [{'this','is'},{'is','a'},{'test'}
]之类的字符串集),其想法是连接共享字符串的成对联合集。下面的代码片段采用文字方法测试成对重叠,连接,并使用内部循环中断重新开始。
我知道这是行人方法,对于可用大小的列表(2到10个字符串之间的200K套)确实需要永远。
有关如何提高效率的建议吗?感谢。
j = 0
while True:
if j == len(setlist): # both for loops are done
break # while
for i in range(0,len(setlist)-1):
for j in range(i+1,len(setlist)):
a = setlist[i];
b = setlist[j];
if not set(a).isdisjoint(b): # ... then join them
newset = set.union( a , b ) # ... new set
del setlist[j] # ... drop highest index
del setlist[i] # ... drop lowest index
setlist.insert(0,newset) # ... introduce consolidated set, which messes up i,j
break # ... back to the top for fresh i,j
else:
continue
break
答案 0 :(得分:2)
作为评论中提到的@ user2357112,可以将其视为图形问题。每个集合都是一个顶点,两个集合之间共享的每个单词都是一个边缘。然后你可以迭代顶点并为每个看不见的顶点做BFS(或DFS)以生成connected component。
其他选项是使用Union-Find。联合查找的优点是您不需要构造图形,并且当所有集合具有相同内容时,不存在退化情况。以下是它的实例:
from collections import defaultdict
# Return ancestor of given node
def ancestor(parent, node):
if parent[node] != node:
# Do path compression
parent[node] = ancestor(parent, parent[node])
return parent[node]
def merge(parent, rank, x, y):
# Merge sets that x & y belong to
x = ancestor(parent, x)
y = ancestor(parent, y)
if x == y:
return
# Union by rank, merge smaller set to larger one
if rank[y] > rank[x]:
x, y = y, x
parent[y] = x
rank[x] += rank[y]
def merge_union(setlist):
# For every word in sets list what sets contain it
words = defaultdict(list)
for i, s in enumerate(setlist):
for w in s:
words[w].append(i)
# Merge sets that share the word
parent = list(range(len(setlist)))
rank = [1] * len(setlist)
for sets in words.values():
it = iter(sets)
merge_to = next(it)
for x in it:
merge(parent, rank, merge_to, x)
# Construct result by union the sets within a component
result = defaultdict(set)
for merge_from, merge_to in enumerate(parent):
result[merge_to] |= setlist[merge_from]
return list(result.values())
setlist = [
{'this', 'is'},
{'is', 'a'},
{'test'},
{'foo'},
{'foobar', 'foo'},
{'foobar', 'bar'},
{'alone'}
]
print(merge_union(setlist))
输出:
[{'this', 'is', 'a'}, {'test'}, {'bar', 'foobar', 'foo'}, {'alone'}]