Question

我试图为卡方分布值列表计算10个百分点。我使用“卡方”是因为我认为这与我们的真实数据看起来最接近。

现在，我正尝试逐步进行此操作，以免丢失任何内容。

import numpy as np
values =  np.array([int(w)*10 for w in list(np.random.chisquare(6,1000))])
print('Min: ', np.max(values))
print('Max: ', np.min(values))
print('Mean: ', np.mean(values))

for p in [w*10 for w in range(1,11,1)]:
    percentile = np.percentile(values,p)
    print(p,percentile)

这是上面代码的示例输出：

Min:  0
Max:  230
Mean:  55.49
Percent: 10 Percentile:  20.0
Percent: 20 Percentile:  30.0
Percent: 30 Percentile:  30.0
Percent: 40 Percentile:  40.0
Percent: 50 Percentile:  50.0
Percent: 60 Percentile:  60.0
Percent: 70 Percentile:  70.0
Percent: 80 Percentile:  80.0
Percent: 90 Percentile:  100.0
Percent: 100 Percentile:  230.0

我苦苦挣扎的观点是：
为什么我得到20％和30％的相同“百分位数”？
我一直认为20/30表示：20％的值低于以下值（在本例中为30）。就像100％的值位于230以下（最大值）一样。

我缺少哪个主意？

Answer 1

因为values是用表达式int(w)*10创建的，所以所有值都是10的整数倍。这意味着大多数值会重复很多次。例如，我只是运行该代码，发现值30重复了119次。事实证明，当您对这些值进行计数时，分位数间隔20％-30％仅包含值30。这就是为什么在您的输出中重复使用值30的原因。

我可以将数据集分解为

将其分成100个一组（因为您有1000个值，并且您查看的是10％，20％等）。

                                                np.percentile
Percent  Group       Values (counts)            (largest value in previous column)
-------  ---------   ------------------------   ----------------------------------
10       0 - 99      0 (14), 10 (72), 20 (16)    20
20       100 - 199   20 (84), 30 (16)            30
30       200 - 299   30 (100)                    30
40       300 - 399   30 (3), 40 (97)             40
etc.

鉴于您使用的发行版，此输出似乎是最有可能的，但是如果您重新运行代码足够多次，您将遇到不同的输出。我只是再次运行并得到

请注意，重复20.0和50.0。此运行的值计数为：

In [56]: values, counts = np.unique(values, return_counts=True)                                                             

In [57]: values                                                                                                             
Out[57]: 
array([  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120,
       130, 140, 150, 160, 170, 180, 190, 210])

In [58]: counts                                                                                                             
Out[58]: 
array([ 14,  73, 129, 134, 134, 119, 105,  67,  73,  33,  41,  21,  19,
        16,   8,   7,   1,   2,   2,   1,   1])

为什么不同的百分位数给出相同的值？

1 个答案: