Question

我有很多样本，并且想要从已定义长度的样本中随机抽取一个子集，然后重复此过程，直到每个样本出现3次，而在给定行中没有两次出现样本。

例如：

samples=range(12)
l=6
repeats=3

我希望有6行，共6个样本。我想要类似的东西：

[1, 2, 11, 7, 0, 3]
[2, 5, 0, 7, 10, 3]
[11, 0, 8, 7, 6, 1]
[4, 11, 5, 9, 3, 6]
[4, 9, 8, 1, 10, 2]
[9, 5, 6, 4, 8, 10]

我尝试了以下方法，但只有在平均抽取样本的情况下（偶然）它才有效，我通常会得到

ValueError: sample larger than population

代码：

import random
samples=range(12)
measured={key:0 for key in samples}
while len(samples)>0:
    sample=random.sample(samples,6)
    print sample
    for s in sample:
        measured[s]+=1
        if measured[s]==3:
            samples.remove(s)

我想知道是否有办法numpy.random.choice或从itertools.permutations开始，但是由于上述限制，这些方法无法正常工作。

我是否可以忽略一个示例方法，还是需要处理嵌套循环/ ifs？

Answer 1

现在您已经弄清了想要的内容，这是我原始答案的修订版，它是基于约束的纯python实现。更改原始答案非常容易，因此我还添加了代码以限制迭代次数，并在最后打印一些报告以确认其符合所有条件。

from collections import Counter
from itertools import chain
from pprint import pprint
import random


def pick_subset(population, length, repeat, max_iterations=1000000):
    iterations = 0

    while iterations < max_iterations:
        # Get subset where every sample value occurrs at exactly "repeat" times.
        while iterations < max_iterations:
            iterations += 1
            subset = [random.sample(population, length) for i in range(length)]
            measure = Counter(chain.from_iterable(subset))
            if all((iterations == repeat for iterations in measure.values())):
                break

        # Check whether there are no more than 2 repeats in per row.
        if all((all((iterations < 2 for iterations in Counter(row).values()))
                   for row in subset)):
            break

    if iterations >= max_iterations:
        raise RuntimeError("Couldn't match criteria after {:,d}".format(iterations))
    else:
        print('Succeeded after {:,d} iterations'.format(iterations))
        return subset


samples = range(12)
length = 6
repeat = 3

subset = pick_subset(samples, length, repeat)

print('')
print('Selected subset:')
pprint(subset)

# Show that each sample occurs exactly three times.
freq_counts = Counter(chain.from_iterable(subset))
print('')
print('Overall sample frequency counts:')
print(', '.join(
        '{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items()))


# Show that no sample occurs more than twice in a each row.
print('')
print('Sample frequency counts per row:')
for i, row in enumerate(subset):
    freq_counts = Counter(row)
    print('  row[{}]: {}'.format(i, ', '.join(
            '{:2d}: {:d}'.format(sample, cnt) for sample, cnt in freq_counts.items())))

示例输出：

Succeeded after 123,847 iterations

Selected subset:
[[4, 9, 10, 2, 5, 7],
 [5, 8, 6, 0, 11, 1],
 [1, 8, 3, 10, 7, 0],
 [7, 3, 2, 4, 11, 9],
 [0, 10, 11, 6, 1, 2],
 [8, 3, 9, 4, 6, 5]]

Overall sample frequency counts:
 0: 3,  1: 3,  2: 3,  3: 3,  4: 3,  5: 3,  6: 3,  7: 3,  8: 3,  9: 3, 10: 3, 11: 3

Sample frequency counts per row:
  row[0]:  2: 1,  4: 1,  5: 1,  7: 1,  9: 1, 10: 1
  row[1]:  0: 1,  1: 1,  5: 1,  6: 1,  8: 1, 11: 1
  row[2]:  0: 1,  1: 1,  3: 1,  7: 1,  8: 1, 10: 1
  row[3]:  2: 1,  3: 1,  4: 1,  7: 1,  9: 1, 11: 1
  row[4]:  0: 1,  1: 1,  2: 1,  6: 1, 10: 1, 11: 1
  row[5]:  3: 1,  4: 1,  5: 1,  6: 1,  8: 1,  9: 1

Answer 2

我可能会误会，但是根据您的头衔，您实际上希望来自samples的数字网格满足以下条件：

每个行和列的条目都是唯一的
samples中的每个元素最多重复repeats次

我认为没有简单的方法可以做到这一点，因为网格中的每个元素都取决于网格中的其他项目。

一个可能的解决方案是一次在网格中填充一个元素，从第一个元素（左上）到最后一个元素（右下）。在每个位置上，您将从一组“有效”值中随机选择，这些值将是尚未为该行或列选择的值，以及尚未被repeats次选择的值。

但是，不能保证此方法每次都能找到解决方案。您可以定义一个函数来搜索一种排列，直到找到一个为止。

这是我使用numpy想到的一种实现：

import numpy as np

samples=range(12)
l=6
repeats=3

def try_make_grid(samples, l, repeats, max_tries=10):
    try_number = 0
    while(try_number < max_tries):
        try:
            # initialize lxl grid to nan
            grid = np.zeros((l, l))*np.nan

            counts = {s: 0 for s in samples}  # counts of each sample
            count_exhausted = set()           # which samples have been exhausted
            for i in range(l):
                for j in range(l):
                    # can't use values that already happened in this row or column
                    invalid_values = set(np.concatenate([grid[:,j], grid[i,:]]))
                    valid_values = [
                        v for v in samples if v not in invalid_values|count_exhausted
                    ]
                    this_choice = np.random.choice(a=valid_values)
                    grid[i,j] = this_choice

                    # update the count and check to see if this_choice is now exhausted
                    counts[this_choice] += 1
                    if counts[this_choice] >= repeats:
                        count_exhausted.add(this_choice)
            print("Successful on try number %d" % try_number)
            return grid
        except:
            try_number += 1
    print("Unsuccessful")

示例网格：

np.random.seed(42)
grid = try_make_grid(samples, l, repeats)
#Successful on try number 6
print(grid)
#[[10.  5.  8. 11.  3.  0.]
# [ 0. 11.  4.  8.  2.  5.]
# [ 1.  6.  0.  2.  7.  3.]
# [ 3.  2.  7. 10. 11.  9.]
# [ 4.  1.  9.  6.  8.  7.]
# [ 6.  9. 10.  5.  1.  4.]]

如您所见，每一行和每一列都是唯一的，并且每个值被选择的次数不超过repeats次（在这种情况下，它们均被精确地选择了repeats次）。

from collections import Counter
print(Counter(grid.ravel()))
#Counter({10.0: 3,
#         5.0: 3,
#         8.0: 3,
#         11.0: 3,
#         3.0: 3,
#         0.0: 3,
#         4.0: 3,
#         2.0: 3,
#         1.0: 3,
#         6.0: 3,
#         7.0: 3,
#         9.0: 3})

使用行和列唯一的条目创建数组的Python方法

2 个答案: