我想将spearman相关性应用于列数相同(每对行的相关性)的两个熊猫数据帧。
我的目标是计算每对行(r,s)之间的spearman相关性分布,其中r是第一个数据帧的一行,s是第二个数据帧的一行。
我知道之前已经回答过类似的问题(请参见this)。但是,此问题有所不同,因为我想将第一个数据帧的每一行与第二个数据帧的所有行进行比较。此外,由于我的数据量大,因此这需要大量的计算,并且要花费数小时。我想对其进行并行化,并可能对其进行重写以加快速度。
我尝试使用numba,但是不幸的是它失败了(与this类似),因为它似乎无法识别scipy spearmanr。我的代码如下:
def corr(a, b):
dist = []
for i in range(a.shape[0]):
for j in range(b.shape[0]):
dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
return dist
答案 0 :(得分:2)
from numba import njit
import pandas as pd
import numpy as np
@njit
def mean1(a):
n = len(a)
b = np.empty(n)
for i in range(n):
b[i] = a[i].mean()
return b
@njit
def std1(a):
n = len(a)
b = np.empty(n)
for i in range(n):
b[i] = a[i].std()
return b
@njit
def c(a, b):
''' Correlation '''
n, k = a.shape
m, k = b.shape
mu_a = mean1(a)
mu_b = mean1(b)
sig_a = std1(a)
sig_b = std1(b)
out = np.empty((n, m))
for i in range(n):
for j in range(m):
out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]
return out
r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)
进行Spearman等级关联只是在进行等级关联。
我们可以利用argsort
获得排名。尽管argsort
中的argsort
确实获得了排名,但我们可以通过切片分配将自己限制为一种。
def rank(a):
i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')
s = a.argsort(1)
out = np.empty_like(s)
out[i, s] = j
return out
在等级相关的情况下,均值和标准差均由数组第二维的大小预先确定。
您可以不用numba就能完成同样的事情,但是我假设您想要它。
from numba import njit
@njit
def c(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
out = np.empty((n, m))
a = a - mu
b = b - mu
for i in range(n):
for j in range(m):
out[i, j] = a[i] @ b[j] / k / sig ** 2
return out
为后代,我们可以完全避免内部循环,但这可能会导致内存问题。
@njit
def c1(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
a = a - mu
b = b - mu
return a @ b.T / k / sig ** 2
np.random.seed([3, 1415])
a = np.random.randn(2, 10)
b = np.random.randn(2, 10)
rank_a = rank(a)
rank_b = rank(b)
c(rank_a, rank_b)
array([[0.32121212, 0.01818182],
[0.13939394, 0.55151515]])
如果您使用的是DataFrame
da = pd.DataFrame(a)
db = pd.DataFrame(b)
pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)
0 1
0 0.321212 0.018182
1 0.139394 0.551515
我们可以使用pandas.DataFrame.corr
pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)
0 1
0 True True
1 True True
答案 1 :(得分:1)
这里是scipy.stats.spearmanr
的基于行的未编译版本,它在大型数据集上使用了大约5%的时间,并显示了一个示例,它产生了相同的结果:
import numpy as np
import pandas as pd
def spearman_row(x, y):
x = np.asarray(x)
y = np.asarray(y)
rx = rankdata_average(x)
ry = rankdata_average(y)
# print(rx)
# print(ry)
return compute_corr(rx, ry)
def compute_corr(x, y):
# Thanks to https://github.com/dengemann
def ss(a, axis):
return np.sum(a * a, axis=axis)
x = np.asarray(x)
y = np.asarray(y)
mx = x.mean(axis=-1)
my = y.mean(axis=-1)
xm, ym = x - mx[..., None], y - my[..., None]
r_num = np.add.reduce(xm * ym, axis=-1)
r_den = np.sqrt(ss(xm, axis=-1) * ss(ym, axis=-1))
with np.errstate(divide='ignore', invalid="ignore"):
r = r_num / r_den
return r
def rankdata_average(data):
"""Row-based rankdata using method=mean"""
dc = np.asarray(data).copy()
sorter = np.apply_along_axis(np.argsort, 1, data)
inv = np.empty(data.shape, np.intp)
ranks = np.tile(np.arange(data.shape[1]), (len(data), 1))
np.put_along_axis(inv, sorter, ranks, axis=1)
dc = np.take_along_axis(dc, sorter, 1)
res = np.apply_along_axis(lambda r: r[1:] != r[:-1], 1, dc)
obs = np.column_stack([np.ones(len(res), dtype=bool), res])
dense = np.take_along_axis(np.apply_along_axis(np.cumsum, 1, obs), inv, 1)
len_r = obs.shape[1]
nonzero = np.count_nonzero(obs, axis=1)
obs = pd.DataFrame(obs)
nonzero = pd.Series(nonzero)
dense = pd.DataFrame(dense)
ranks = []
for _nonzero, nzdf in obs.groupby(nonzero, sort=False):
nz = np.apply_along_axis(lambda r: np.nonzero(r)[0], 1, nzdf)
_count = np.column_stack([nz, np.ones(len(nz)) * len_r])
_dense = dense.reindex(nzdf.index).values
_result = 0.5 * (np.take_along_axis(_count, _dense, 1) + np.take_along_axis(_count, _dense - 1, 1) + 1)
result = pd.DataFrame(_result, index=nzdf.index)
ranks.append(result)
final = pd.concat(ranks).sort_index()
return final
if __name__ == "__main__":
from scipy.stats import rankdata, spearmanr
from time import time
np.random.seed(0)
size = int(1e5), 5
d1 = np.random.randint(5, size=size)
d2 = np.random.randint(5, size=size)
start = time()
actual = spearman_row(d1, d2)
end = time()
print("actual", actual)
print("rowbased took", end - start)
start = time()
expected = []
for i in range(len(d1)):
expected.append(spearmanr(d1[i], d2[i]).correlation)
end = time()
print("scipy took", end - start)
expected = np.array(expected)
print("largest diff", pd.Series(expected - actual).abs().max())
它打印:
rowbased took 3.6308434009552
scipy took 53.552557945251465
largest diff 2.220446049250313e-16
答案 2 :(得分:0)
Pandas在corr
的支持下具有spearman
功能。它适用于列,因此我们只需转置dataFrame。
我们将df1附加到df2并通过迭代每一行来计算相关性
len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()
df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}