我有22个变量,我想得到相关分数,不是作为相关矩阵,而是在数据框中,通过成对...
我的意思是......不喜欢这个
v1 v2 v3 v4
v1 1 x x x
v2 x 1 x x
v3 x x 1 x
v4 x x x 1
但是像这样:
var1 var2 cor
v1 v2 x
v1 v3 x
v1 v4 x
v2 v3 x
v2 v4 x
v3 v4 x
我是R的新手,我一直在研究很多,最后我得到了一个代码,真诚地,根本没有效率...我的代码创建了一个巨大的数据框,包含所有可能的组合22变量(4194304 combinatios ... 很多!!! )...然后代码只分配前211行的相关性,这是仅有2个变量的组合......然后我排除了我不感兴趣的一切。嗯......我得到了我需要的东西。但我确信这是一种非常愚蠢的方式,我想学习一种更好的方法...... 有什么提示吗?
我的代码:
#Getting the variable names from the data frame
av_variables<-variable.names(data.1)
#Creating a huge data frame for all possible combinations
corr_combinations <- as.data.frame(matrix(1,0,length(av_variables)))
for (i in 1:length(av_variables)){
corr_combinations.i <- t(combn(av_variables,i))
corr_combinations.new <- as.data.frame(matrix(1,length(corr_combinations.i[,1]),length(av_variables)))
corr_combinations.new[,1:i] <- corr_combinations.i
corr_combinations <- rbind(corr_combinations,corr_combinations.new)
#How many combinations for 0:2 variables?
comb_par_var<-choose(20, k=0:2)
##211
#A new column to recieve the values
corr_combinations$cor <- 0
#Getting the correlations and assigning to the empty column
for (i in (length(av_variables)+1):(length(av_variables)+ sum(comb_par_var) +1)){
print(i/length(corr_combinations[,1]))
corr_combinations$cor[i] <- max(as.dist(abs(cor(data.1[,as.character(corr_combinations[i,which(corr_combinations[i,]!=0&corr_combinations[i,]!=1)])]))))
# combinations$cor[i] <- max(as.dist(abs(cor(data.0[,as.character(combinations[i,combinations[i,]!=0&combinations[i,]!=1])]))))
}
#Keeping only the rows with the combinations of 2 variables
corr_combinations[1:(length(av_variables)+ sum(comb_par_var) +2),21]
corr_combinations<-corr_combinations[1:212,]
corr_combinations<-corr_combinations[21:210,]
#Keeping only the columns var1, var2 and cor
corr_combinations<-corr_combinations[,c(1,2,21)]
#Ordering to keep only the pairs with correlation >0.95,
#which was my purpose the whole time
corr_combinations <- corr_combinations[order(corr_combinations$cor),]
corr_combinations<-corr_combinations[corr_combinations$cor >0.95, ]
}
答案 0 :(得分:4)
您可以一次性计算完整的相关矩阵。那你就需要重塑一下。一个例子,
cr <- cor(mtcars)
# This is to remove redundancy as upper correlation matrix == lower
cr[upper.tri(cr, diag=TRUE)] <- NA
reshape2::melt(cr, na.rm=TRUE, value.name="cor")
答案 1 :(得分:2)
一个基本R替代方法是对与combn
一起拉出的行/列名称使用矩阵子集。
# get pairwise combination of variable names
vars <- t(combn(colnames(myMat), 2))
# build data.frame with matrix subsetting
data.frame(vars, myMat[vars])
X1 X2 myMat.vars.
1 V1 V2 0.8500071
2 V1 V3 -0.2828288
3 V1 V4 -0.2867921
4 V2 V3 -0.2698210
5 V2 V4 -0.2273411
6 V3 V4 0.9962044
您也可以使用setNames
在一行中添加列名。
setNames(data.frame(vars, myMat[vars]), c("var1", "var2", "corr"))
数据强>
set.seed(1234)
myMat <- cor(matrix(rnorm(16), 4, dimnames=list(paste0("V", 1:4), paste0("V", 1:4))))
myMat
V1 V2 V3 V4
V1 1.0000000 0.8500071 -0.2828288 -0.2867921
V2 0.8500071 1.0000000 -0.2698210 -0.2273411
V3 -0.2828288 -0.2698210 1.0000000 0.9962044
V4 -0.2867921 -0.2273411 0.9962044 1.0000000
答案 2 :(得分:1)
您可以使用tidyr
重塑相关矩阵。
首先,create a correlation matrix:
> d <- data.frame(x1=rnorm(10),
+ x2=rnorm(10),
+ x3=rnorm(10))
> x <- cor(d) # get correlations (returns matrix)
> x
x1 x2 x3
x1 1.0000000 0.3096685 -0.5358578
x2 0.3096685 1.0000000 -0.7497212
x3 -0.5358578 -0.7497212 1.0000000
然后,使用tidyr重塑:
> y <- as.data.frame(x)
> y$var1 <- row.names(y)
> library(tidyr)
> gather(data = y, key = "var2", value = "correlation", -var1)
var1 var2 correlation
1 x1 x1 1.0000000
2 x2 x1 0.3096685
3 x3 x1 -0.5358578
4 x1 x2 0.3096685
5 x2 x2 1.0000000
6 x3 x2 -0.7497212
7 x1 x3 -0.5358578
8 x2 x3 -0.7497212
9 x3 x3 1.0000000