我注意到,即使NumPy的numpy.percentile
和TensorFlow Probability的tfp.stats.percentile
对其“最近”插值方法也给出了相同的文档字符串解释
此可选参数指定当所需百分位数位于两个数据点
i < j
之间时要使用的插值方法:...
“最近”:
i
或j
,以最近的一个为准。
他们给出不同的结果。下面是我的意思的一个简单的工作示例。
$ "$(which python3)" --version
Python 3.7.5
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
numpy~=1.18
tensorflow~=2.1
tensorflow-probability~=0.9
black
(question) $ python -m pip install -r requirements.txt
# question.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
def main():
a = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
q = 50
print(f"Flattened array: {a.flatten()}")
print("NumPy:")
print(f"\t{q}th percentile (linear): {np.percentile(a, q, interpolation='linear')}")
print(
f"\t{q}th percentile (nearest): {np.percentile(a, q, interpolation='nearest')}"
)
b = tf.convert_to_tensor(a)
print("TensorFlow:")
print(
f"\t{q}th percentile (linear): {tfp.stats.percentile(b, q, interpolation='linear')}"
)
print(
f"\t{q}th percentile (nearest): {tfp.stats.percentile(b, q, interpolation='nearest')}"
)
if __name__ == '__main__':
main()
运行时对于“最近”插值方法给出不同的结果
(question) $ python question.py
Flattened array: [10. 7. 4. 3. 2. 1.]
NumPy:
50th percentile (linear): 3.5
50th percentile (nearest): 3.0
TensorFlow:
50th percentile (linear): 3.5
50th percentile (nearest): 4.0
戳NumPy v1.18.2 source of the function that numpy.percentile
is calling后,我仍然对为什么感到困惑。看来这是由于四舍五入的决定(鉴于NumPy uses numpy.around
和TFP uses tf.round
)。
有人可以向我解释造成差异的原因是什么?我想对这些函数做一个填充,但是我需要了解返回行为。
答案 0 :(得分:1)
逐步了解两者的来源,似乎不是像我首先这样的四舍五入问题,但是numpy.percentile
对升序排序的ndarray进行了最终评估,而{ {3}}在降序张量上进行。
# answer.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability.python.internal import tensorshape_util
from tensorflow_probability.python.internal import distribution_util
def numpy_src(input, q, axis=0, out=None):
a = input
q = np.true_divide(q, 100) # 0.5
q = np.asanyarray(q) # array(0.5)
q = q[None] # array([0.5])
ap = a.flatten() # array([10., 7., 4., 3., 2., 1.])
Nx = ap.shape[axis] # 6
indices = q * (Nx - 1) # array([2.5])
indices = np.around(indices).astype(np.intp) # array([2])
ap.partition(indices, axis=axis) # array([ 1., 2., 3., 4., 7., 10.])
indices = indices[0] # 2
r = np.take(ap, indices, axis=axis, out=out) # 3.0
print(f"Result of np.percentile source: {r}")
def tensorflow_src(input, q=50, axis=None):
x = input
name = "percentile"
interpolation = "nearest"
q = tf.cast(q, tf.float64) # tf.Tensor(50.0, shape=(), dtype=float64)
if axis is None:
y = tf.reshape(
x, [-1]
) # tf.Tensor([10. 7. 4. 3. 2. 1.], shape=(6,), dtype=float64)
frac_at_q_or_above = 1.0 - q / 100.0 # tf.Tensor(0.5, shape=(), dtype=float64)
# _sort_tensor(y)
# N.B. Here is the difference. Note the sort order is never changed
sorted_y, _ = tf.math.top_k(
y, k=tf.shape(y)[-1]
) # tf.Tensor([10. 7. 4. 3. 2. 1.], shape=(6,), dtype=float64), _
tensorshape_util.set_shape(
sorted_y, y.shape
) # tf.Tensor([10. 7. 4. 3. 2. 1.], shape=(6,), dtype=float64)
d = tf.cast(tf.shape(y)[-1], tf.float64) # tf.Tensor(6.0, shape=(), dtype=float64)
# _get_indices(interpolation)
indices = tf.round(
(d - 1) * frac_at_q_or_above
) # tf.Tensor(2.0, shape=(), dtype=float64)
indices = tf.clip_by_value(
tf.cast(indices, tf.int32), 0, tf.shape(y)[-1] - 1
) # tf.Tensor(2, shape=(), dtype=int32)
# N.B. The sort order here is descending, causing a difference
gathered_y = tf.gather(
sorted_y, indices, axis=-1
) # tf.Tensor(4.0, shape=(), dtype=float64)
result = distribution_util.rotate_transpose(gathered_y, tf.rank(q)) # 4.0
print(f"Result of tf.percentile source: {result}")
def main():
np_in = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
numpy_src(np_in, q=50)
tf_in = tf.convert_to_tensor(np_in)
tensorflow_src(tf_in, q=50)
if __name__ == "__main__":
main()
运行时会给出
$ python answer.py
Result of np.percentile source: 3.0
Result of tf.percentile source: 4.0
如果相反,则在TensorFlow概率的percentile
中添加了以下内容,以使评估的排序顺序升
sorted_y = tf.reverse(
sorted_y, [-1]
) # tf.Tensor([ 1. 2. 3. 4. 7. 10.], shape=(6,), dtype=float64)
然后两个结果将相同
$ python answer.py
Result of np.percentile source: 3.0
Result of tf.percentile source: 3.0
鉴于TensorFlow概率的tfp.stats.percentile
表示
给定向量
x
,q
的{{1}}百分位数是{的排序副本中从最小值到最大值的方式的值x
{1}}。
这似乎是错误的,因为正好相反。我已经打开docstring进行讨论。