Question

~~我正在寻找一种算法，通过一次挑选6个值来有效地生成数据集的所有三个值组合。~~

我正在寻找一种算法来有效地生成一小组6元组，这些元组累积地表达数据集的所有可能的3元组合。

例如，计算表示所有可能的3张卡组合的6张牌的纸牌牌。

例如，给定数据集：

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

6个值的第一个“选择”可能是：

['a','b','c','d','e','f']

这涵盖了三个价值组合：

('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')

显然可以通过：

计算所有三联组合
选择6个值
计算这6个值的所有三元组合
从第一次计算中删除这些组合
重复，直到所有人都被解释为

在此示例中，有2600种可能的三元组合(26*25*24)/(3*2*1) == 2600，并且使用上面的“强力”方法，所有三元组合可以在大约301个6值组中表示。

但是，感觉应该有更有效的方法来实现这一目标。

我的首选语言为python，但我打算在C++中实施此语言。

更新

这是我的“蛮力”它的python代码：

from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')

def calculate(data_set):
  all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
  data = set(all_triplets)
  sextuples = []
  while data:
    sxt = set()
    for item in data:
      nxt = sxt | item
      if len(nxt) > 6:
        continue
      sxt = nxt
      if len(nxt) == 6:
        break
    sextuples.append(list(sxt))
    covers = set(frozenset(x) for x in combinations(list(sxt),3))
    data = data - covers
    print "%r\t%s" % (list(sxt),len(data))
  print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)

calculate(data_set)

在301个六元组中完成2600个三元组

我正在寻找比这更有效的计算方法。

更新

Senderle提供了an interesting solution：将数据集分成对，然后生成所有可能的三元组。这肯定比我想出的任何东西都要好。

这是一个快速功能，可检查是否涵盖所有三元组并评估三元组覆盖的冗余：

from itertools import combinations
def check_coverage(data_set,sextuplets):
  all_triplets = dict.fromkeys(combinations(data_set,3),0)
  sxt_count = 0
  for sxt in sextuplets:
    sxt_count += 1
    for triplet in combinations(sxt,3):
      all_triplets[triplet] += 1
  total = len(all_triplets)
  biggest_overlap = overlap = nohits = onehits = morehits = 0
  for k,v in all_triplets.iteritems():
    if v == 0:
      nohits += 1
    elif v == 1:
      onehits += 1
    else:
      morehits += 1
      overlap += v - 1
    if v > biggest_overlap:
      biggest_overlap = v
  print "All Triplets in dataset: %6d" % (total,)
  print "Total triplets from sxt: %6d" % (total + overlap,)
  print "Number of sextuples:     %6d\n" % (sxt_count,)
  print "Missed  %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
  print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
  print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
  print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
  print "Biggest Overlap: %3d" % (biggest_overlap,)

使用Senderle的sextuplets生成器我很着迷观察到重复的三元组是局部化的，随着数据集的大小增加，重复会成比例地更加局部化并且峰值重复更大。

>>> check_coverage(range(26),sextuplets(range(26)))
All Triplets in dataset:   2600
Total triplets from sxt:   5720
Number of sextuples:        286

Missed       0 of   2600:    0.0%
HitOnce   2288 of   2600:   88.0%
HitMore    312 of   2600:   12.0%
Overlap   3120 of   2600:  120.0%
Biggest Overlap:  11

>>> check_coverage(range(40),sextuplets(range(40)))
All Triplets in dataset:   9880
Total triplets from sxt:  22800
Number of sextuples:       1140

Missed       0 of   9880:    0.0%
HitOnce   9120 of   9880:   92.3%
HitMore    760 of   9880:    7.7%
Overlap  12920 of   9880:  130.8%
Biggest Overlap:  18

>>> check_coverage(range(80),sextuplets(range(80)))
All Triplets in dataset:  82160
Total triplets from sxt: 197600
Number of sextuples:       9880

Missed       0 of  82160:    0.0%
HitOnce  79040 of  82160:   96.2%
HitMore   3120 of  82160:    3.8%
Overlap 115440 of  82160:  140.5%
Biggest Overlap:  38

Answer 1

尝试 itertools 模块中的combinations功能：

from itertools import combinations

for triplet in combinations(dataset, 3):
    print triplet

Answer 2

我相信以下内容会产生正确的结果。它依赖于生成所有必需的sextuplet的直觉，所有必要的是生成任意对项的所有可能组合。这＆＃34;混合＆＃34;足够好地表示所有可能的三元组都被表示出来。

有轻微的皱纹。对于奇数个项目，一对根本不是一对，因此您无法从中生成一个sextuplet，但仍需要表示该值。这样做了一些体操来回避这个问题;可能有更好的方法，但我不确定它是什么。

from itertools import izip_longest, islice, combinations

def sextuplets(seq, _fillvalue=object()):
    if len(seq) < 6:
        yield [tuple(seq)]
        return
    it = iter(seq)
    pairs = izip_longest(it, it, fillvalue=_fillvalue)
    sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
    for st in sextuplets:
        if st[-1] == _fillvalue:
            # replace fill value with valid item not in sextuplet
            # while maintaining original order
            for i, (x, y) in enumerate(zip(st, seq)):
                if x != y:
                    st = st[0:i] + (y,) + st[i:-1]
                    break
        yield st

我在长度为10到80的项目序列上进行了测试，并且在所有情况下都能生成正确的结果。我没有证据证明这会为所有序列提供正确的结果。我也没有证据证明这是一套最小的六胞胎。但我很乐意听到任何一个证据，如果有人能想出一个。

>>> def gen_triplets_from_sextuplets(st):
...     triplets = [combinations(s, 3) for s in st]
...     return set(t for trip in triplets for t in trip)
... 
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i)) 
                   for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True

虽然我已经这样说了，但我再次指出这是一种实际生成三元组的低效方法，因为它会产生重复。

>>> def gen_triplet_list_from_sextuplets(st):
...     triplets = [combinations(s, 3) for s in st]
...     return list(t for trip in triplets for t in trip)
... 
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160

事实上，虽然理论上你应该加速......

>>> len(list(sextuplets(range(80))))
9880

对于小序列，

... itertools.combinations仍然优于sextuplets：

>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop

对于中等大小的序列，它仍与sextuplets竞争：

>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop

除非您使用非常大的序列，否则我不确定这是值得的。（尽管如此，这是一个有趣的问题。）

所有三联组合，一次6个值

2 个答案: