Question

我在使用scipy.stats的spearmanr时遇到了一些奇怪的问题。我正在使用多项式的值来获得一些更有趣的相关性，但是如果我手动输入值（作为列表，转换为numpy数组），我会得到与我得到的不同的相关性如果我使用函数计算值。下面的代码应该证明我的意思：

import numpy as np
from scipy.stats import spearmanr    
data = np.array([  0.4,   1.2,   1. ,   0.4,   0. ,   0.4,   2.2,   6. ,  12.4,  22. ])
axis = np.arange(0, 10, dtype=np.float64)

print(spearmanr(axis, data))# gives a correlation of 0.693...

# Use this polynomial
poly = lambda x:  0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0

data2 = poly(axis)
print(data2) # It is the same as data

print(spearmanr(axis, data2))# gives a correlation of 0.729...

我注意到数组略有不同（即data - data2并非所有元素都为零），但差别很小 - 1e-16的数量级。

这么微小的差异足以甩掉矛兵吗？

Answer 1

这么微小的差异足以甩掉矛兵吗？

是的，因为Spearman的 r 基于样本排名。这种微小的差异可以改变原本相同的价值等级：

Uncaught (in promise) Error: (app/utils...) injectAsyncSagas: Expected "sagas" to be an array of generator functions

如果你添加一个小梯度（大于你观察到的数值差异）来打破这种联系，你将得到相同的结果：

sp.stats.rankdata(data)
# array([  3.,   6.,   5.,   3.,   1.,   3.,   7.,   8.,   9.,  10.])
# Note that all three values of 0.4 get the same rank 3.

sp.stats.rankdata(data2)
# array([  2.5,   6. ,   5. ,   2.5,   1. ,   4. ,   7. ,   8. ,   9. ,  10. ])
# Note that two values 0.4 get the rank 2.5 and one gets 4.

然而，这将打破任何可能有意的联系，并可能导致过高或低估相关性。如果预计数据具有离散值，numpy.round可能是更好的解决方案。

scipy.stats.spearmanr的不同结果取决于数据的生成方式

1 个答案: