我正在寻找一种算法,通过一次挑选6个值来有效地生成数据集的所有三个值组合。
我正在寻找一种算法来有效地生成一小组6元组,这些元组累积地表达数据集的所有可能的3元组合。
例如,计算表示所有可能的3张卡组合的6张牌的纸牌牌。
例如,给定数据集:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
6个值的第一个“选择”可能是:
['a','b','c','d','e','f']
这涵盖了三个价值组合:
('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')
显然可以通过:
在此示例中,有2600种可能的三元组合(26*25*24)/(3*2*1) == 2600
,并且使用上面的“强力”方法,所有三元组合可以在大约301个6值组中表示。
但是,感觉应该有更有效的方法来实现这一目标。
我的首选语言为python
,但我打算在C++
中实施此语言。
更新
这是我的“蛮力”它的python代码:
from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')
def calculate(data_set):
all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
data = set(all_triplets)
sextuples = []
while data:
sxt = set()
for item in data:
nxt = sxt | item
if len(nxt) > 6:
continue
sxt = nxt
if len(nxt) == 6:
break
sextuples.append(list(sxt))
covers = set(frozenset(x) for x in combinations(list(sxt),3))
data = data - covers
print "%r\t%s" % (list(sxt),len(data))
print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)
calculate(data_set)
在301个六元组中完成2600个三元组
我正在寻找比这更有效的计算方法。
更新
Senderle提供了an interesting solution:将数据集分成对,然后生成所有可能的三元组。这肯定比我想出的任何东西都要好。
这是一个快速功能,可检查是否涵盖所有三元组并评估三元组覆盖的冗余:
from itertools import combinations
def check_coverage(data_set,sextuplets):
all_triplets = dict.fromkeys(combinations(data_set,3),0)
sxt_count = 0
for sxt in sextuplets:
sxt_count += 1
for triplet in combinations(sxt,3):
all_triplets[triplet] += 1
total = len(all_triplets)
biggest_overlap = overlap = nohits = onehits = morehits = 0
for k,v in all_triplets.iteritems():
if v == 0:
nohits += 1
elif v == 1:
onehits += 1
else:
morehits += 1
overlap += v - 1
if v > biggest_overlap:
biggest_overlap = v
print "All Triplets in dataset: %6d" % (total,)
print "Total triplets from sxt: %6d" % (total + overlap,)
print "Number of sextuples: %6d\n" % (sxt_count,)
print "Missed %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
print "Biggest Overlap: %3d" % (biggest_overlap,)
使用Senderle的sextuplets
生成器我很着迷观察到重复的三元组是局部化的,随着数据集的大小增加,重复会成比例地更加局部化并且峰值重复更大。
>>> check_coverage(range(26),sextuplets(range(26))) All Triplets in dataset: 2600 Total triplets from sxt: 5720 Number of sextuples: 286 Missed 0 of 2600: 0.0% HitOnce 2288 of 2600: 88.0% HitMore 312 of 2600: 12.0% Overlap 3120 of 2600: 120.0% Biggest Overlap: 11 >>> check_coverage(range(40),sextuplets(range(40))) All Triplets in dataset: 9880 Total triplets from sxt: 22800 Number of sextuples: 1140 Missed 0 of 9880: 0.0% HitOnce 9120 of 9880: 92.3% HitMore 760 of 9880: 7.7% Overlap 12920 of 9880: 130.8% Biggest Overlap: 18 >>> check_coverage(range(80),sextuplets(range(80))) All Triplets in dataset: 82160 Total triplets from sxt: 197600 Number of sextuples: 9880 Missed 0 of 82160: 0.0% HitOnce 79040 of 82160: 96.2% HitMore 3120 of 82160: 3.8% Overlap 115440 of 82160: 140.5% Biggest Overlap: 38
答案 0 :(得分:2)
尝试 itertools 模块中的combinations功能:
from itertools import combinations
for triplet in combinations(dataset, 3):
print triplet
答案 1 :(得分:1)
我相信以下内容会产生正确的结果。它依赖于生成所有必需的sextuplet的直觉,所有必要的是生成任意对项的所有可能组合。这"混合"足够好地表示所有可能的三元组都被表示出来。
有轻微的皱纹。对于奇数个项目,一对根本不是一对,因此您无法从中生成一个sextuplet,但仍需要表示该值。这样做了一些体操来回避这个问题;可能有更好的方法,但我不确定它是什么。
from itertools import izip_longest, islice, combinations
def sextuplets(seq, _fillvalue=object()):
if len(seq) < 6:
yield [tuple(seq)]
return
it = iter(seq)
pairs = izip_longest(it, it, fillvalue=_fillvalue)
sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
for st in sextuplets:
if st[-1] == _fillvalue:
# replace fill value with valid item not in sextuplet
# while maintaining original order
for i, (x, y) in enumerate(zip(st, seq)):
if x != y:
st = st[0:i] + (y,) + st[i:-1]
break
yield st
我在长度为10到80的项目序列上进行了测试,并且在所有情况下都能生成正确的结果。我没有证据证明这会为所有序列提供正确的结果。我也没有证据证明这是一套最小的六胞胎。但我很乐意听到任何一个证据,如果有人能想出一个。
>>> def gen_triplets_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return set(t for trip in triplets for t in trip)
...
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i))
for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True
虽然我已经这样说了,但我再次指出这是一种实际生成三元组的低效方法,因为它会产生重复。
>>> def gen_triplet_list_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return list(t for trip in triplets for t in trip)
...
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160
事实上,虽然理论上你应该加速......
>>> len(list(sextuplets(range(80))))
9880
对于小序列, ... itertools.combinations
仍然优于sextuplets
:
>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop
对于中等大小的序列,它仍与sextuplets
竞争:
>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop
除非您使用非常大的序列,否则我不确定这是值得的。 (尽管如此,这是一个有趣的问题。)