Question

我创建了一个函数，它将返回一个包含输入列表元素的字典以及从列表中选择该项目的概率：

from collections import Counter

def proba(x):
    n = len(x)
    return {key: val/n for key, val in dict(Counter(x)).items()}

有更快的解决方案吗？如果概率的输出顺序对应于元素的输入顺序，我不需要输出在kay：value对中。

Answer 1

在对Eelco的回答的评论中，你写了

如果输入是np.random.randint（低= 0，高= 100，大小= 50000）......

numpy_indexed有一些强大的工具，但对于这样的数据，您可以使用numpy.bincount获得更好的效果：

In [11]: import numpy as np

In [12]: import numpy_indexed as npi

In [13]: x = np.random.randint(low=0, high=100, size=50000)

这是使用numpy.bincount的计算。结果是一个长度为x.max()+1的数组。

In [14]: np.bincount(x)/len(x)
Out[14]: 
array([ 0.01066,  0.01022,  0.01048,  0.00994,  0.01026,  0.00972,
        0.0107 ,  0.00962,  0.0098 ,  0.00922,  0.00996,  0.01038,
        0.01024,  0.01118,  0.01012,  0.01098,  0.00988,  0.00996,
        0.00974,  0.0097 ,  0.00994,  0.01004,  0.0099 ,  0.01034,
        0.01066,  0.01032,  0.01042,  0.00896,  0.00958,  0.01008,
        0.01038,  0.00974,  0.01068,  0.00952,  0.00998,  0.00932,
        0.00964,  0.0103 ,  0.0099 ,  0.0093 ,  0.0101 ,  0.01012,
        0.0097 ,  0.00988,  0.0099 ,  0.01076,  0.01008,  0.0097 ,
        0.00986,  0.00998,  0.00976,  0.00984,  0.01008,  0.01008,
        0.00938,  0.00998,  0.00976,  0.0093 ,  0.00974,  0.00958,
        0.00984,  0.01032,  0.00988,  0.01014,  0.01088,  0.01006,
        0.0097 ,  0.01026,  0.00952,  0.01002,  0.00938,  0.01024,
        0.01024,  0.00984,  0.00922,  0.01044,  0.0101 ,  0.01052,
        0.01002,  0.00996,  0.0101 ,  0.00976,  0.00986,  0.01062,
        0.01064,  0.01008,  0.00992,  0.00972,  0.01006,  0.01026,
        0.01018,  0.01044,  0.0092 ,  0.00982,  0.00994,  0.00958,
        0.00958,  0.01012,  0.01024,  0.00996])

这是时间的比较;请注意结果单位的变化：

In [24]: %timeit npi.count(x)[1]/len(x)
1.35 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [25]: %timeit np.bincount(x)/len(x)
76.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Answer 2

这种方法占97.6％的时间：

def proba_2(x):
    n = len(x)
    single_prob = 1/n
    d = {}
    for i in x:
        if i in d:
            d[i] += single_prob
        else:
            d[i] = single_prob
    return d

虽然没有任何显着的差距（超过1000次运行的平均差异为0.006）。从本质上讲，您的代码是经过算法优化的（就像它O(n)一样，剩下的就是微优化。

完整的测试代码：

from collections import Counter
from timeit import Timer
import random

def proba_1(x):
    n = len(x)
    return {key: val/n for key, val in dict(Counter(x)).items()}

def proba_2(x):
    n = len(x)
    single_prob = 1/n
    d = {}
    for i in x:
        if i in d:
            d[i] += single_prob
        else:
            d[i] = single_prob
    return d


t = Timer(lambda: proba_1(l))
t_2 = Timer(lambda: proba_2(l))

p1 = 0
p2 = 0

total_diff = 0.0

for i in range(1,1001):
    l = [random.randrange(1,101,1) for _ in range (100)]
    if i % 2 == 0:
        proba_1_time = t.timeit(number=10000)
        proba_2_time = t_2.timeit(number=10000)
    else:
        proba_2_time = t_2.timeit(number=10000)
        proba_1_time = t.timeit(number=10000)

    print(proba_1(l),proba_1_time, proba_2(l), proba_2_time)
    if proba_1_time < proba_2_time:
        print("Proba_1 wins: " + str(proba_1_time))
        p1 += 1
    else:
        print("Proba_2 wins: " + str(proba_2_time))
        p2 += 1
    total_diff += abs(proba_1_time - proba_2_time)

    print(p1,p2, total_diff/i)

Answer 3

numpy_indexed包（免责声明：我是它的作者）提供了numpy arraysetops模块的概括;包括以优雅和矢量化的方式解决问题的实用程序：

import numpy_indexed as npi
keys, counts = npi.count(x)
proba = counts / len(x)

不确定它在性能方面如何叠加到Counter;我相信这是非常优化的。但是，在x的元素本身可以表示为ndarray的情况下，我希望这种方法能够领先。

获取列表中元素存在的概率的最快方法

3 个答案: