我有很多样本,并且想要从已定义长度的样本中随机抽取一个子集,然后重复此过程,直到每个样本出现3次,而在给定行中没有两次出现样本。
例如:
samples=range(12)
l=6
repeats=3
我希望有6行,共6个样本。 我想要类似的东西:
[1, 2, 11, 7, 0, 3]
[2, 5, 0, 7, 10, 3]
[11, 0, 8, 7, 6, 1]
[4, 11, 5, 9, 3, 6]
[4, 9, 8, 1, 10, 2]
[9, 5, 6, 4, 8, 10]
我尝试了以下方法,但只有在平均抽取样本的情况下(偶然)它才有效,我通常会得到
ValueError: sample larger than population
代码:
import random
samples=range(12)
measured={key:0 for key in samples}
while len(samples)>0:
sample=random.sample(samples,6)
print sample
for s in sample:
measured[s]+=1
if measured[s]==3:
samples.remove(s)
我想知道是否有办法numpy.random.choice
或从itertools.permutations
开始,但是由于上述限制,这些方法无法正常工作。
我是否可以忽略一个示例方法,还是需要处理嵌套循环/ ifs?
答案 0 :(得分:1)
现在您已经弄清了想要的内容,这是我原始答案的修订版,它是基于约束的纯python实现。更改原始答案非常容易,因此我还添加了代码以限制迭代次数,并在最后打印一些报告以确认其符合所有条件。
from collections import Counter
from itertools import chain
from pprint import pprint
import random
def pick_subset(population, length, repeat, max_iterations=1000000):
iterations = 0
while iterations < max_iterations:
# Get subset where every sample value occurrs at exactly "repeat" times.
while iterations < max_iterations:
iterations += 1
subset = [random.sample(population, length) for i in range(length)]
measure = Counter(chain.from_iterable(subset))
if all((iterations == repeat for iterations in measure.values())):
break
# Check whether there are no more than 2 repeats in per row.
if all((all((iterations < 2 for iterations in Counter(row).values()))
for row in subset)):
break
if iterations >= max_iterations:
raise RuntimeError("Couldn't match criteria after {:,d}".format(iterations))
else:
print('Succeeded after {:,d} iterations'.format(iterations))
return subset
samples = range(12)
length = 6
repeat = 3
subset = pick_subset(samples, length, repeat)
print('')
print('Selected subset:')
pprint(subset)
# Show that each sample occurs exactly three times.
freq_counts = Counter(chain.from_iterable(subset))
print('')
print('Overall sample frequency counts:')
print(', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items()))
# Show that no sample occurs more than twice in a each row.
print('')
print('Sample frequency counts per row:')
for i, row in enumerate(subset):
freq_counts = Counter(row)
print(' row[{}]: {}'.format(i, ', '.join(
'{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items())))
示例输出:
Succeeded after 123,847 iterations
Selected subset:
[[4, 9, 10, 2, 5, 7],
[5, 8, 6, 0, 11, 1],
[1, 8, 3, 10, 7, 0],
[7, 3, 2, 4, 11, 9],
[0, 10, 11, 6, 1, 2],
[8, 3, 9, 4, 6, 5]]
Overall sample frequency counts:
0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3
Sample frequency counts per row:
row[0]: 2: 1, 4: 1, 5: 1, 7: 1, 9: 1, 10: 1
row[1]: 0: 1, 1: 1, 5: 1, 6: 1, 8: 1, 11: 1
row[2]: 0: 1, 1: 1, 3: 1, 7: 1, 8: 1, 10: 1
row[3]: 2: 1, 3: 1, 4: 1, 7: 1, 9: 1, 11: 1
row[4]: 0: 1, 1: 1, 2: 1, 6: 1, 10: 1, 11: 1
row[5]: 3: 1, 4: 1, 5: 1, 6: 1, 8: 1, 9: 1
答案 1 :(得分:1)
我可能会误会,但是根据您的头衔,您实际上希望来自samples
的数字网格满足以下条件:
samples
中的每个元素最多重复repeats
次我认为没有简单的方法可以做到这一点,因为网格中的每个元素都取决于网格中的其他项目。
一个可能的解决方案是一次在网格中填充一个元素,从第一个元素(左上)到最后一个元素(右下)。在每个位置上,您将从一组“有效”值中随机选择,这些值将是尚未为该行或列选择的值,以及尚未被repeats
次选择的值。
但是,不能保证此方法每次都能找到解决方案。您可以定义一个函数来搜索一种排列,直到找到一个为止。
这是我使用numpy
想到的一种实现:
import numpy as np
samples=range(12)
l=6
repeats=3
def try_make_grid(samples, l, repeats, max_tries=10):
try_number = 0
while(try_number < max_tries):
try:
# initialize lxl grid to nan
grid = np.zeros((l, l))*np.nan
counts = {s: 0 for s in samples} # counts of each sample
count_exhausted = set() # which samples have been exhausted
for i in range(l):
for j in range(l):
# can't use values that already happened in this row or column
invalid_values = set(np.concatenate([grid[:,j], grid[i,:]]))
valid_values = [
v for v in samples if v not in invalid_values|count_exhausted
]
this_choice = np.random.choice(a=valid_values)
grid[i,j] = this_choice
# update the count and check to see if this_choice is now exhausted
counts[this_choice] += 1
if counts[this_choice] >= repeats:
count_exhausted.add(this_choice)
print("Successful on try number %d" % try_number)
return grid
except:
try_number += 1
print("Unsuccessful")
示例网格:
np.random.seed(42)
grid = try_make_grid(samples, l, repeats)
#Successful on try number 6
print(grid)
#[[10. 5. 8. 11. 3. 0.]
# [ 0. 11. 4. 8. 2. 5.]
# [ 1. 6. 0. 2. 7. 3.]
# [ 3. 2. 7. 10. 11. 9.]
# [ 4. 1. 9. 6. 8. 7.]
# [ 6. 9. 10. 5. 1. 4.]]
如您所见,每一行和每一列都是唯一的,并且每个值被选择的次数不超过repeats
次(在这种情况下,它们均被精确地选择了repeats
次)。
from collections import Counter
print(Counter(grid.ravel()))
#Counter({10.0: 3,
# 5.0: 3,
# 8.0: 3,
# 11.0: 3,
# 3.0: 3,
# 0.0: 3,
# 4.0: 3,
# 2.0: 3,
# 1.0: 3,
# 6.0: 3,
# 7.0: 3,
# 9.0: 3})