Question

我注意到，即使NumPy的numpy.percentile和TensorFlow Probability的tfp.stats.percentile对其“最近”插值方法也给出了相同的文档字符串解释

此可选参数指定当所需百分位数位于两个数据点i < j之间时要使用的插值方法：

...

“最近”：i或j，以最近的一个为准。

他们给出不同的结果。下面是我的意思的一个简单的工作示例。

环境

$ "$(which python3)" --version
Python 3.7.5
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
numpy~=1.18
tensorflow~=2.1
tensorflow-probability~=0.9
black
(question) $ python -m pip install -r requirements.txt

代码

# question.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp


def main():
    a = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    q = 50
    print(f"Flattened array: {a.flatten()}")
    print("NumPy:")
    print(f"\t{q}th percentile (linear): {np.percentile(a, q, interpolation='linear')}")
    print(
        f"\t{q}th percentile (nearest): {np.percentile(a, q, interpolation='nearest')}"
    )

    b = tf.convert_to_tensor(a)
    print("TensorFlow:")
    print(
        f"\t{q}th percentile (linear): {tfp.stats.percentile(b, q, interpolation='linear')}"
    )
    print(
        f"\t{q}th percentile (nearest): {tfp.stats.percentile(b, q, interpolation='nearest')}"
    )


if __name__ == '__main__':
    main()

运行时对于“最近”插值方法给出不同的结果

(question) $ python question.py
Flattened array: [10.  7.  4.  3.  2.  1.]
NumPy:
    50th percentile (linear): 3.5
    50th percentile (nearest): 3.0
TensorFlow:
    50th percentile (linear): 3.5
    50th percentile (nearest): 4.0

戳NumPy v1.18.2 source of the function that numpy.percentile is calling后，我仍然对为什么感到困惑。看来这是由于四舍五入的决定（鉴于NumPy uses numpy.around和TFP uses tf.round）。

有人可以向我解释造成差异的原因是什么？我想对这些函数做一个填充，但是我需要了解返回行为。

Answer 1

逐步了解两者的来源，似乎不是像我首先这样的四舍五入问题，但是numpy.percentile对升序排序的ndarray进行了最终评估，而{ {3}}在降序张量上进行。

# answer.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability.python.internal import tensorshape_util
from tensorflow_probability.python.internal import distribution_util


def numpy_src(input, q, axis=0, out=None):
    a = input
    q = np.true_divide(q, 100)  # 0.5
    q = np.asanyarray(q)  # array(0.5)
    q = q[None]  # array([0.5])
    ap = a.flatten()  # array([10.,  7.,  4.,  3.,  2.,  1.])
    Nx = ap.shape[axis]  # 6
    indices = q * (Nx - 1)  # array([2.5])
    indices = np.around(indices).astype(np.intp)  # array([2])
    ap.partition(indices, axis=axis)  # array([ 1.,  2.,  3.,  4.,  7., 10.])
    indices = indices[0]  # 2
    r = np.take(ap, indices, axis=axis, out=out)  # 3.0
    print(f"Result of np.percentile source: {r}")


def tensorflow_src(input, q=50, axis=None):
    x = input
    name = "percentile"
    interpolation = "nearest"
    q = tf.cast(q, tf.float64)  # tf.Tensor(50.0, shape=(), dtype=float64)
    if axis is None:
        y = tf.reshape(
            x, [-1]
        )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    frac_at_q_or_above = 1.0 - q / 100.0  # tf.Tensor(0.5, shape=(), dtype=float64)
    # _sort_tensor(y)
    # N.B. Here is the difference. Note the sort order is never changed
    sorted_y, _ = tf.math.top_k(
        y, k=tf.shape(y)[-1]
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64), _
    tensorshape_util.set_shape(
        sorted_y, y.shape
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    d = tf.cast(tf.shape(y)[-1], tf.float64)  # tf.Tensor(6.0, shape=(), dtype=float64)
    # _get_indices(interpolation)
    indices = tf.round(
        (d - 1) * frac_at_q_or_above
    )  # tf.Tensor(2.0, shape=(), dtype=float64)
    indices = tf.clip_by_value(
        tf.cast(indices, tf.int32), 0, tf.shape(y)[-1] - 1
    )  # tf.Tensor(2, shape=(), dtype=int32)
    # N.B. The sort order here is descending, causing a difference
    gathered_y = tf.gather(
        sorted_y, indices, axis=-1
    )  # tf.Tensor(4.0, shape=(), dtype=float64)
    result = distribution_util.rotate_transpose(gathered_y, tf.rank(q))  # 4.0
    print(f"Result of tf.percentile source: {result}")


def main():
    np_in = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    numpy_src(np_in, q=50)
    tf_in = tf.convert_to_tensor(np_in)
    tensorflow_src(tf_in, q=50)


if __name__ == "__main__":
    main()

运行时会给出

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 4.0

如果相反，则在TensorFlow概率的percentile中添加了以下内容，以使评估的排序顺序升

sorted_y = tf.reverse(
    sorted_y, [-1]
)  # tf.Tensor([ 1.  2.  3.  4.  7. 10.], shape=(6,), dtype=float64)

然后两个结果将相同

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 3.0

鉴于TensorFlow概率的tfp.stats.percentile表示

给定向量x，q的{{1}}百分位数是{的排序副本中从最小值到最大值的方式的值x {1}}。

这似乎是错误的，因为正好相反。我已经打开docstring进行讨论。

NumPy百分位数和TensorFlow百分位数对“最近”插值方法的结果不同

环境

代码

1 个答案: