random.sample中使用的常量的对齐

时间:2018-03-26 16:58:36

标签: python python-3.x random sample

我正在查看random.py(python标准库)中函数示例的源代码。

这个想法很简单:

  • 如果大人口(n)需要小样本(k) :只需选择k个随机索引,因为您不太可能选择两倍于人口的数字相同。如果你这样做,那就再选一次。
  • 如果需要相对较大的样本(k),则与总人口(n)相比:最好跟踪您所选择的内容。

我的问题

涉及一些常数,setsize = 21setsize += 4 ** _log(3*k,4)。临界比率约为k:21 + 3k。评论说# size of a small set minus size of an empty list# table size for big sets

  • 这些具体数字来自哪里?有什么理由?

这些评论有所启发,但我发现他们带来的问题与他们回答的一样多。

  • 我会理解,一小组的大小,但找到一个空列表的大小"#34;混乱。有人可以对此有所了解吗?
  • " table"具体是指什么?大小,如同说"设置大小"。

在github存储库上看,它看起来像一个非常古老的版本,只使用比率k:6 * k作为临界比率,但我发现同样神秘。

代码

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence or set.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use range as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(range(10000000), 60)
    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    if isinstance(population, _Set):
        population = tuple(population)
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    randbelow = self._randbelow
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("Sample larger than population or is negative")
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = randbelow(n-i)
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        selected = set()
        selected_add = selected.add
        for i in range(k):
            j = randbelow(n)
            while j in selected:
                j = randbelow(n)
            selected_add(j)
            result[i] = population[j]
    return result

(我很抱歉这个问题会更好地放在math.stackexchange中。我无法想到这个特定比率的任何概率/统计原因,而且评论听起来好像,这可能是使用设置和列表使用的空间量 - 但无法在任何地方找到任何详细信息。)

1 个答案:

答案 0 :(得分:2)

此代码试图确定使用列表或集合是否会占用更多空间(而不是出于某种原因估计时间成本)。

看起来21是空列表的大小和Python构建的小集之间的差异,这个常量是在这个常量上确定的,以指针大小的倍数表示。我没有构建那个版本的Python,但是对我的64位CPython 3.6.3进行测试会产生20个指针大小的差异:

>>> sys.getsizeof(set()) - sys.getsizeof([])
160

并将3.6.3 listset结构定义与引入此代码的list中的setchange定义进行比较,21似乎合理的。

我说“空列表的大小与集合之间的差异”因为现在和当时,小集合使用了包含在集合结构本身内部的散列表而不是外部分配:

setentry smalltable[PySet_MINSIZE];

if k > 5:
    setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets

check添加为大于5个元素的集合分配的外部表的大小,其大小再次以指针数表示。此计算假定集合从不收缩,因为采样算法从不删除元素。我目前不确定这个计算是否准确。

最后,

if n <= setsize:

将集合的基本开销加上外部哈希表使用的任何空间与输入元素列表所需的n指针进行比较。 (它似乎没有考虑list(population)执行的分配,因此可能低估了列表的成本。)