Question

我的目标是找到尽可能少的子集[a-f]来组成全套A.

A = set([1,2,3,4,5,6,7,8,9,10]) # full set


#--- below are sub sets of A ---

a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])

实际上，我正在处理的父集A包含15k个独特元素，具有30k个子集，这些子集的长度范围从单个唯一元素到1.5k个唯一元素。

到目前为止，我正在使用的代码看起来或多或少与以下相似并且非常缓慢：

import random


B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)

Dict = {}

for i in Bx: # iterate through shuffled keys.
    z = [i]
    x = B[i]
    L = len(x)

    while L < len(A):
        for ii in Bx:
            x = x | B[ii]
            Lx = len(x)
            if Lx > L:
                L = Lx
                z.append(ii)

    try:
        Dict[len(z)].append(z)
    except KeyError:
        Dict[len(z)] = [z]

print Dict[min(Dict.keys()]

这只是为了说明我采取的方法。为了清楚起见，我省略了一些逻辑，这些逻辑可以最大限度地减少已经过大的集合的迭代以及其他类似的事情。

我认为Numpy在这类问题上真的很擅长，但我想不出办法使用它。

Answer 1

问题是要求Set Cover Problem的实现，没有快速算法可以找到最佳解决方案。然而，问题的贪婪解决方案 - 重复选择包含尚未覆盖的大多数元素的子集 - 在合理的时间内做得很好。

您可以在this previous question

的python中找到该算法的实现

编辑补充：@Aaron Hall的answer可以通过使用以下插件替换他的greedy_set_cover例程来改进。在Aaron的代码中，每当我们想要向封面添加子集时，我们会为每个剩余的子集计算得分len(s-result_set)。但是，这个分数只会随着我们添加到result_set而减少;因此，如果在当前迭代中我们选择了一个得分最高的集合，其得分高于之前迭代中实现的剩余子集，我们知道他们的得分无法改善，可以忽略它们。这表明使用优先级队列来存储子集以进行处理;在python中，我们可以用heapq：

实现这个想法

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results

为了比较，这是我的笔记本电脑上Aaron代码的时间。当我们使用堆时，我删除了remove_redundant_subsets，无论如何都不会重新处理主导的子集：

INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46

这是上面代码的时间;比你快3倍多。

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46

注意：这两个算法以不同的顺序处理子集，偶尔会给出集合大小的不同答案;这可以归结为分数相关时子集的“幸运”选择。

优先级队列/堆是贪婪算法的一个众所周知的优化，虽然我找不到对此链接的体面讨论。

虽然贪心算法是获得近似答案的快捷方法，但您可以通过花费时间来改善答案，因为我们知道我们在最小集合上有一个上限。这样做的技术包括模拟退火或分支定界算法，如this article by Peter Norvig

中所示

Answer 2

以下是使用itertools.combinations迭代各种子集组合的解决方案，union(*x)将它们组合起来。

import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
    found = []
    for n in range(2,len(subsets)):
        for x in itertools.combinations(subsets, n):
            u =  set().union(*x)
            if A==u:
                found.append(x)
        if found:
            break
    return found
print foo(A,subsets)

产生

[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])), 
 (set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]

对于这个例子，它运行速度比你的代码快一点，但如果我扩展它以跟踪它运行的子集名称有点慢。但这只是一个小例子，所以时间并不重要。（编辑 - 如另一个答案所示，这种方法会因较大的问题而显着减慢）。

numpy没有帮助，因为我们没有处理数组或并行操作。正如其他人所写，它基本上是一个搜索问题。你可以加快内部步骤，并试图修剪掉后端，但你无法摆脱尝试许多替代方案。

在numpy中进行搜索的常用方法是构建所有组合的矩阵，然后用sum，min或max之类的东西拉出所需的组合。这是一种利用阵列上快速编译操作的强力方法。

Answer 3

感谢您提出这个问题，我发现它非常有趣。我已经在Python 2.6,2.7和3.3上测试了下面的代码，您可能会发现自己运行它很有趣，我可以很容易地粘贴到解释器或作为脚本运行。

这里的另一个解决方案试图通过强力解决，即通过每个可能的组合，这可能是十个元素可以做的，提问者给出了一个例子，但不会提供提问者要求的参数的解决方案，即从一组30,000套中选择子集的组合（最多1500个元素长，来自15000个元素的超集）。我发现这些参数，试图找到一个解决方案集，其中n = 40（非常不可能）意味着在一个googol上搜索许多组合的顺序，这是非常不可能的。

设置

这里我导入一些用于对我的函数进行基准测试并创建数据的模块。我还创建了一个定时器装饰器来包装函数，这样我就可以轻松地测量函数完成之前经过的时间（或者我放弃并中断函数）。

import functools
import time
import logging
import random

# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000

def timer(f):
    '''
    timer wrapper modified, original obtained from:
    http://stackoverflow.com/questions/5478351/python-time-measure-function
    '''
    @functools.wraps(f)
    def wrap(*args):
        time1 = time.time()
        try:
            ret = f(*args)
        except KeyboardInterrupt:
            time2 = time.time()
            logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        else:
            time2 = time.time()
            logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        return ret
    return wrap

数据创建功能

接下来我必须创建数据：

@timer
def make_subsets(parent_set, n):
    '''create list of subset sets, takes about 17 secs'''
    subsets = []
    for i in range(n): # use xrange in python 2
        subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
    return subsets


@timer
def include_complement(parent_set, subsets):
    '''ensure no missing elements from parent, since collected randomly'''
    union_subsets = set().union(*subsets)
    subsets_complement = set(parent_set) - union_subsets
    logging.info('len of union of all subsets was {0}'.format(
                                          len(union_subsets)))
    if subsets_complement:
        logging.info('len of subsets_complement was {0}'.format(
                                          len(subsets_complement)))
        subsets.append(subsets_complement)
    return subsets

可选预处理

我提供了一些预处理，它在几秒钟内运行，但没有多大帮助，只加快了几分之一秒，但这里记录了读者的启发：

@timer
def remove_redundant_subsets(subsets):
    '''
    without break, takes a while, removes 81 sets of len <= 4 (seed(0))
    in 5.5 minutes, so breaking at len 10 for 4 second completion.
    probably unnecessary if truly random subsets
    but *may* be good if large subsets are subsets of others.
    '''
    subsets.sort(key=len)
    remove_list = []
    for index, s in enumerate(subsets, 1):
        if len(s) > 10: # possible gain not worth continuing farther
            break
        if any(s.issubset(other) for other in subsets[index:]):
            logging.debug('will remove subset: {s}'.format(s=s))
            remove_list.append(s)
    logging.info('subsets removing: {0}'.format(len(remove_list)))
    for s in remove_list:
        subsets.remove(s)
    return subsets

实际功能

然后我实际执行了Greedy Algorithm：

@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        # maybe room for optimization here: Will still have to calculate.
        # But custom max could shortcut subsets on uncovered more than len.
        add_set = max(subsets, key=lambda x: len(x - result_set))
        logging.debug('len of add_set is {0}'.format(len(add_set)))
        results.append(add_set)
        result_set.update(add_set)
    return results

这是 main（）：

# full set, use xrange instead of range in python 2 for space efficiency    
parent_set = range(PARENT_SIZE) 
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
    logging.debug('len of set {0} is {1}'.format(i, len(set)))

最终结果

考虑到提问者提供的原始参数，在Python 2中运行，这提供了46个（ish）子集的最终结果，仅仅超过3分钟。

这是种子（0）的输出：

INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46

这是种子（1）的输出：

INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47

这很有趣，谢谢你的问题。

PS：我决定尝试对天真的蛮力方法进行基准测试：

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms

当然，我打断了它，因为它在我的一生中也不会接近，也许是我们太阳的一生？：

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L

设置封面或击球套装; Numpy，最少的元素组合，以弥补全套

3 个答案:

设置

数据创建功能

可选预处理

实际功能

最终结果