假设给定了 n 个集合,并且想要构造所有最小集,每个输入集至少具有一个共同点。如果不存在作为 S 子集的可接受的集合 S',则将 S 称为最小集。
一个例子:
In: s1 = {1, 2, 3}; s2 = {3, 4, 5}; s3 = {5, 6}
Out: [{1, 4, 6}, {1, 5}, {2, 4, 6}, {2, 5}, {3, 5}, {3, 6}]
我的想法是迭代地将一个集合添加到另一个集合中:
result = f(s1, f(s2, f(s3, ...)))
其中f
是一个合并函数,看起来可能如下:
function f(newSet, setOfSets):
Step 1:
return all elements of setOfSets that share an element with newSet
Step 2:
for each remaining element setE of setOfSets:
for each element e of newSet:
return union(setE, {e})
上述方法的问题在于,在步骤2中计算出的笛卡尔积可能包含在步骤1中返回的集合的超集。我正在考虑遍历所有已经返回的集合(请参见Find minimal set of subsets that covers a given set),但这似乎过于复杂和效率低下,我希望在我的特殊情况下有更好的解决方案。
在不确定第2步中完整笛卡尔积的情况下如何实现目标?
请注意,这个问题与the question of finding the smallest set only有关,但是我需要找到以上述指定方式最小的所有集。我知道解决方案的数量将不是多项式。
输入集的数量 n 将是多个Hundret,但是这些集仅包含有限范围内的元素(例如,大约20个不同的值),这也限制了集的大小。如果算法在 O(n ^ 2)中运行,这是可以接受的,但它基本上应该是输出集的线性变量(可能带有对数乘数)。
答案 0 :(得分:1)
由于您的空间非常有限-只有20个值可供选择-用钝器将这东西打死:
1
位。如果都满足基本条件,则对候选人进行验证。代码:
from time import time
start = time()
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
# Convert each set to its bit-map
point_set = [7, 28, 48]
# make list of all possible covering bitmaps
cover = list(range(2**20))
while cover:
# Pop any item from remaining covering sets
candidate = cover.pop(0)
# Does this bitmap have a bit in common with each target set?
if all((candidate & point) for point in point_set):
print(candidate)
# Remove all candidates that are supersets of the successful covering one.
superset = set([other for other in cover if (candidate & ~other) == 0])
cover = [item for item in cover if item not in superset]
print(time() - start, "lag time")
print(time() - start, "seconds")
输出-我尚未将候选整数转换回其组成元素。这是一项简单的任务。
请注意,此示例中的大部分时间都用尽了用尽不是经过验证的封面集的 not 超集的整数列表,例如32的所有倍数(低6位都是零,因此与任何封面集都不相交。
这33秒是在我老化的台式计算机上;您的笔记本电脑或其他平台几乎可以肯定更快。我相信,更高效的算法所做的任何改进都可以轻易抵消,因为该算法易于实施且易于理解。
17
0.4029195308685303 lag time
18
0.6517734527587891 lag time
20
0.8456630706787109 lag time
36
1.0555419921875 lag time
41
1.2604553699493408 lag time
42
1.381387710571289 lag time
33.005757570266724 seconds
答案 1 :(得分:1)
我根据here描述了基于trie数据结构的解决方案。尝试使确定存储的集合之一是否是另一个给定集合(Savnik, 2013)的子集相对较快。
解决方案如下:
最坏情况下的运行时间是 O(nmc),如果仅考虑 n'<= n ,则 m 是最大解数。输入集的em>,而 c 是子集查找的时间因子。
代码在下面。我已经基于python软件包datrie实现了该算法,该软件包是trie高效C实现的包装。下面的代码在cython中,但是可以通过删除/交换cython特定命令轻松地转换为纯python。
扩展的trie实现:
echo
可以如下使用:
from datrie cimport BaseTrie, BaseState, BaseIterator
cdef bint has_subset_c(BaseTrie trie, BaseState trieState, str setarr,
int index, int size):
cdef BaseState trieState2 = BaseState(trie)
cdef int i
trieState.copy_to(trieState2)
for i in range(index, size):
if trieState2.walk(setarr[i]):
if trieState2.is_terminal() or has_subset_c(trie, trieState2, setarr,
i, size):
return True
trieState.copy_to(trieState2)
return False
cdef class SetTrie():
def __init__(self, alphabet, initSet=[]):
if not hasattr(alphabet, "__iter__"):
alphabet = range(alphabet)
self.trie = BaseTrie("".join(chr(i) for i in alphabet))
self.touched = False
for i in initSet:
self.trie[chr(i)] = 0
if not self.touched:
self.touched = True
def has_subset(self, superset):
cdef BaseState trieState = BaseState(self.trie)
setarr = "".join(chr(i) for i in superset)
return bool(has_subset_c(self.trie, trieState, setarr, 0, len(setarr)))
def extend(self, sets):
for s in sets:
self.trie["".join(chr(i) for i in s)] = 0
if not self.touched:
self.touched = True
def delete_supersets(self):
cdef str elem
cdef BaseState trieState = BaseState(self.trie)
cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
if trieIter.next():
elem = trieIter.key()
while trieIter.next():
self.trie._delitem(elem)
if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
self.trie._setitem(elem, 0)
elem = trieIter.key()
if has_subset_c(self.trie, trieState, elem, 0, len(elem)):
val = self.trie.pop(elem)
if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
self.trie._setitem(elem, val)
def update_by_settrie(self, SetTrie setTrie, maxSize=inf, initialize=True):
cdef BaseIterator trieIter = BaseIterator(BaseState(setTrie.trie))
cdef str s
if initialize and not self.touched and trieIter.next():
for s in trieIter.key():
self.trie._setitem(s, 0)
self.touched = True
while trieIter.next():
self.update(set(trieIter.key()), maxSize, True)
def update(self, otherSet, maxSize=inf, isStrSet=False):
if not isStrSet:
otherSet = set(chr(i) for i in otherSet)
cdef str subset, newSubset, elem
cdef list disjointList = []
cdef BaseTrie trie = self.trie
cdef int l
cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
if trieIter.next():
subset = trieIter.key()
while trieIter.next():
if otherSet.isdisjoint(subset):
disjointList.append(subset)
trie._delitem(subset)
subset = trieIter.key()
if otherSet.isdisjoint(subset):
disjointList.append(subset)
trie._delitem(subset)
cdef BaseState trieState = BaseState(self.trie)
for subset in disjointList:
l = len(subset)
if l < maxSize:
if l+1 > self.maxSizeBound:
self.maxSizeBound = l+1
for elem in otherSet:
newSubset = subset + elem
trieState.rewind()
if not has_subset_c(self.trie, trieState, newSubset, 0,
len(newSubset)):
trie[newSubset] = 0
def get_frozensets(self):
return (frozenset(ord(t) for t in subset) for subset in self.trie)
def clear(self):
self.touched = False
self.trie.clear()
def prune(self, maxSize):
cdef bint changed = False
cdef BaseIterator trieIter
cdef str k
if self.maxSizeBound > maxSize:
self.maxSizeBound = maxSize
trieIter = BaseIterator(BaseState(self.trie))
k = ''
while trieIter.next():
if len(k) > maxSize:
self.trie._delitem(k)
changed = True
k = trieIter.key()
if len(k) > maxSize:
self.trie._delitem(k)
changed = True
return changed
def __nonzero__(self):
return self.touched
def __repr__(self):
return str([set(ord(t) for t in subset) for subset in self.trie])
时间:
def cover_sets(sets):
strie = SetTrie(range(10), *([i] for i in sets[0]))
for s in sets[1:]:
strie.update(s)
return strie.get_frozensets()
结果:
from timeit import timeit
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
%timeit cover_sets([s1, s2, s3])
请注意,上面的trie实现仅适用于大于(但不等于)37.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
的键。否则,整数到字符的映射将无法正常工作。这个问题可以通过索引移位来解决。