问题标题说明了这一点,什么是计算控制每个其他变量的矩阵的每列之间的成对偏相关的最有效方法?
基本上,类似于下面的cor
函数,但导致部分相关而不是简单相关。
#> cor(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
结果应与我们使用ppcor
库获得的结果相匹配:
#> ppcor::pcor(iris[,-5])$estimate
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
答案 0 :(得分:8)
我们知道控制每个其他变量的成对部分相关可以通过在O(n ^ 3)时间内反演相关或协方差矩阵(见here)来获得。所以一个可能的解决方案就是:
pcor.solve = function(x){
res = solve(cov(x))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
return(res)
}
这基本上是ppcor::pcor
的精简版。结果是:
pcor.solve(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
但请注意,协方差矩阵(或相关矩阵,结果相同)必须是正定的。
由于这主要归结为有效的反转操作,我在stats.SE中查看了这个主题。可以在协方差矩阵中使用qr.solve
和chol2inv
来达到同样的效果。
pcor.qr = function(x){
res = qr.solve(cov(x))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.qr(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
pcor.chol = function(x){
res = chol2inv(chol(cov(x)))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.chol(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
SVD也可以用来解决。如果我们有一个正定方阵,那么它的SVD分解是A = UDU ^ T,它的逆是简单的A ^ -1 = UD ^ -1U ^ T.
pcor.svd = function(x){
res = svd(cov(x))
res = res$v %*% diag(1/res$d) %*% t(res$v)
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.svd(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
microbenchmark
,重复10000次:
library(microbenchmark)
#iris
dt1 = iris[,-5]
microbenchmark(
ppcor = ppcor::pcor(dt1)$estimate,
solve = pcor.solve(dt1),
qr = pcor.qr(dt1),
chol = pcor.chol(dt1),
svd = pcor.svd(dt1),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 247.728 267.790 314.8356 280.853 296.248 196962.601 10000 c
# solve 176.816 198.743 217.1298 205.274 221.603 2425.964 10000 b
# qr 240.264 258.459 282.7005 270.123 285.518 4015.438 10000 c
# chol 131.562 148.824 163.3567 154.423 167.019 1593.205 10000 a
# svd 179.615 199.675 219.2781 208.074 223.469 1920.710 10000 b
#random data
dt2 = cbind(rnorm(1E4), rnorm(1E4)+2)
microbenchmark(
ppcor = ppcor::pcor(dt2)$estimate,
solve = pcor.solve(dt2),
qr = pcor.qr(dt2),
chol = pcor.chol(dt2),
svd = pcor.svd(dt2),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 243.063 267.323 306.4535 284.585 311.177 1833.936 10000 d
# solve 180.548 190.812 222.6685 198.277 216.004 84776.704 10000 a
# qr 229.068 248.662 282.8142 262.658 285.518 1954.301 10000 c
# chol 179.148 189.413 212.6551 198.277 216.005 1383.733 10000 a
# svd 213.672 230.933 262.5084 243.529 264.058 5261.543 10000 b
#uncorrelated data
dt3 = cbind(sin(seq(0, 2*pi, length.out = 1000L)), cos(seq(0, 2*pi, length.out = 1000L)))
microbenchmark(
ppcor = ppcor::pcor(dt3)$estimate,
solve = pcor.solve(dt3),
qr = pcor.qr(dt3),
chol = pcor.chol(dt3),
svd = pcor.svd(dt3),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 142.759 162.354 188.7767 172.1500 191.745 2230.021 10000 d
# solve 80.711 89.108 102.8269 92.3740 101.704 1709.372 10000 a
# qr 130.629 145.092 168.0627 153.0220 169.351 4914.910 10000 c
# chol 79.777 87.709 102.2984 92.3740 101.238 6731.117 10000 a
# svd 112.901 127.363 147.1913 134.1285 148.358 1401.928 10000 b
[已更新] 或者换句话说,chol
< solve
< svd
< qr
< ppcor
暂时。协方差矩阵是对称的(chol
解决方案已经使用了这个事实),并且在协方差计算中也可以获得时间,因此可能会获得一些加速的优势。
当然,ppcor
库更通用,处理协方差矩阵不可逆的情况等,因此它在比较中处于劣势。人们也可以提出这样的观点,尽管在部分相关性将被详尽地计算并且知道协方差矩阵是正定的时,期望得到更简单的解决方案。