我有一个如下所示的数据框:
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
实际上有更多列,但想法是有很多特征(如上例中的bmi,height和IQ)然后再次使用相同数量的列,但这些是回归后的标准化残差一些变量(上面例子中的列称为bmi.residuals,height.residuals和IQ.residuals)。我想创建一个具有每对特征和残差之间相关性的对象,如下所示:
trait correlation
bmi 0.85
height 0.90
IQ 0.75
因此,相关性“bmi”是bmi和bmi.residuals之间的相关性,相关性“身高”是身高和身高之间的相关性。残留,IQ是IQ和IQ.residuals等之间的相关性。
我可以逐个计算所有相关性,但如果我在数据帧中有很多列(很多特征),必须有一些方法来自动化。有什么想法?我怀疑lapply会派上用场,但不确定如何......
答案 0 :(得分:2)
使用dplyr
和tidyr
的另一种解决方案。我们的想法是首先创建所有相关性,因为这很简单且足够快,然后创建数据集并在变量名称匹配时仅保留行,但不一样:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
df %>%
select(-ID) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2) # keep pairs that v1 matches v2, but are not the same
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
另一种方法是首先发现兴趣对,然后计算相关性:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)
# keep only columns that you want to get correlations
df2 = df %>% select(-ID)
expand.grid(v1=names(df2), v2=names(df2)) %>% # get all possible combinations of names
filter(f(v1,v2) & v1 != v2) %>% # keep pairs of names where v1 matches v2, but are not the same
mutate(cor = f2(v1,v2)) # for those pairs (only) obtain correlation value
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
我建议您选择速度较快的一个,因为您拥有的行数和列数可能会影响上述方法的速度。
答案 1 :(得分:1)
也许这对你有用:
bmi <- c(26, 27, 23)
height <- c(187, 176, 189)
bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)
df <- data.frame(bmi, height, bmi.residuals, height.residuals)
corr_df <- data.frame(cor(df))
names <- colnames(df)
names <- names[!grepl("residuals", names)]
cors <- data.frame(
traits = character(length(names)),
correlation = numeric(length(names)),
stringsAsFactors = FALSE
)
for (i in 1:length(names)) {
cors$traits[i] <- names[i]
cors$correlation[i] <-
corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}
输入:
> df
bmi height bmi.residuals height.residuals
1 26 187 0.1 0.3
2 27 176 0.3 0.2
3 23 189 0.4 0.1
相关矩阵:
> corr_df
bmi height bmi.residuals height.residuals
bmi 1.0000000 -0.78920304 -0.57655666 0.7205767
height -0.7892030 1.00000000 -0.04676098 -0.1428571
bmi.residuals -0.5765567 -0.04676098 1.00000000 -0.9819805
height.residuals 0.7205767 -0.14285714 -0.98198051 1.0000000
输出:
> cors
traits correlation
1 bmi -0.5765567
2 height -0.1428571
请注意,只有在原始列位于.residual列之前时才会有效。
答案 2 :(得分:1)
这是一个简短的解决方案:
假设您有一个包含变量a,a.resi,b,b.resi的数据框
df <- data.frame(a=c(1:10), b=c(1:10),
a.resi=c(-1:-10), b.resi=c(-1:-10))
首先,使用所有核心变量(即没有后缀.resi)创建一个向量(名为“core”)
core <- names(df) [1:2]
然后,使用paste0()
创建另一个包含核心变量和后缀.resi的向量(名为core.resi)core.resi <- paste0(core, '.resi')
定义一个带有3个参数的函数:数据帧(Data),x和y。这个 函数将计算数据帧Data
中给定x和y之间的相关性MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])
最后,将该函数应用于vectors core和core.resi
mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>%
data.frame()
答案 3 :(得分:1)
你可以尝试一个整合的解决方案:
library(tidyverse)
cor(d[,-1]) %>%
as.tibble() %>%
add_column(Trait=colnames(.)) %>%
gather(key, value, -Trait) %>%
rowwise() %>%
filter(grepl(paste(Trait, collapse = "|"), key)) %>%
filter(Trait != key) %>%
ungroup()
# A tibble: 3 x 3
Trait key value
<chr> <chr> <dbl>
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01
或者您直接从data.frame开始:
d %>%
gather(key, value, -ID) %>%
mutate(gr=strtrim(key,2)) %>%
split(.$gr) %>%
map(~spread(.,key, value)) %>%
map(~cor(.[-1:-2])[,2]) %>%
map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>%
bind_rows()
Trait1 Trait2 cor
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01