Question

假设我有一个值向量和一个概率向量。我想计算值的百分位数，但是使用给定的概率向量。

比如说，

import numpy as np
vector = np.array([4, 2, 3, 1])
probs = np.array([0.7, 0.1, 0.1, 0.1])

忽略probs，np.percentile(vector, 10)给了我1.3。但是，很明显，这里最低的10％具有1的值，因此这将是我想要的输出。

如果结果位于两个数据点之间，我更喜欢线性插值为documented for the original percentile function 。

如何在Python中最方便地解决这个问题？在我的示例中，vector将不会被排序。 probs始终为1。根据任何合理的定义，我更喜欢不需要“非标准”包装的解决方案。

Answer 1

一种解决方案是通过numpy.random.choice和numpy.percentile使用抽样：

N = 50 # number of samples to draw
samples = np.random.choice(vector, size=N, p=probs, replace=True)
interpolation = "nearest"
print("25th percentile",np.percentile(samples, 25, interpolation=interpolation),)
print("75th percentile",np.percentile(samples, 75, interpolation=interpolation),)

根据您的数据类型（离散或连续），您可能希望为interpolation参数使用不同的值。

Answer 2

如果您准备对值进行排序，则可以构建插值函数，以便计算概率分布的倒数。使用scipy.interpolate比使用纯numpy例程更容易做到这一点：

import scipy.interpolate
ordering = np.argsort(vector)
distribution = scipy.interpolate.interp1d(np.cumsum(probs[ordering]), vector[ordering], bounds_error=False, fill_value='extrapolate')

如果您使用百分位数（在0..1范围内）询问此分布，您应该得到所需的答案，例如distribution(0.1)给出1.0，distribution(0.5)给出大约3.29。

使用numpy的interp()函数可以做类似的事情，避免对scipy的额外依赖，但这将涉及每次想要计算百分位数时重建插值函数。如果您在估计概率分布之前有一个固定的百分位列表，那么这可能没问题。

给定分布时计算百分位数

2 个答案: