Question

我有一个如下所示的数据框：

ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4

实际上有更多列，但想法是有很多特征（如上例中的bmi，height和IQ）然后再次使用相同数量的列，但这些是回归后的标准化残差一些变量（上面例子中的列称为bmi.residuals，height.residuals和IQ.residuals）。我想创建一个具有每对特征和残差之间相关性的对象，如下所示：

trait correlation 
bmi 0.85
height 0.90
IQ 0.75

因此，相关性“bmi”是bmi和bmi.residuals之间的相关性，相关性“身高”是身高和身高之间的相关性。残留，IQ是IQ和IQ.residuals等之间的相关性。

我可以逐个计算所有相关性，但如果我在数据帧中有很多列（很多特征），必须有一些方法来自动化。有什么想法？我怀疑lapply会派上用场，但不确定如何......

Answer 1

使用dplyr和tidyr的另一种解决方案。我们的想法是首先创建所有相关性，因为这很简单且足够快，然后创建数据集并在变量名称匹配时仅保留行，但不一样：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)


df %>% 
  select(-ID) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)    # keep pairs that v1 matches v2, but are not the same

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

另一种方法是首先发现兴趣对，然后计算相关性：

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)

# keep only columns that you want to get correlations
df2 = df %>% select(-ID)

expand.grid(v1=names(df2), v2=names(df2)) %>%  # get all possible combinations of names
  filter(f(v1,v2) & v1 != v2) %>%              # keep pairs of names where v1 matches v2, but are not the same
  mutate(cor = f2(v1,v2))                      # for those pairs (only) obtain correlation value

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

我建议您选择速度较快的一个，因为您拥有的行数和列数可能会影响上述方法的速度。

Answer 2

也许这对你有用：

bmi <- c(26, 27, 23)
height <- c(187, 176, 189)

bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)

df <- data.frame(bmi, height, bmi.residuals, height.residuals)

corr_df <- data.frame(cor(df))

names <- colnames(df)
names <- names[!grepl("residuals", names)]

cors <- data.frame(
  traits = character(length(names)),
  correlation = numeric(length(names)),
  stringsAsFactors = FALSE
)

for (i in 1:length(names)) {
  cors$traits[i] <- names[i]
  cors$correlation[i] <-
    corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}

输入：

> df
  bmi height bmi.residuals height.residuals
1  26    187           0.1              0.3
2  27    176           0.3              0.2
3  23    189           0.4              0.1

相关矩阵：

> corr_df
                        bmi      height bmi.residuals height.residuals
bmi               1.0000000 -0.78920304   -0.57655666        0.7205767
height           -0.7892030  1.00000000   -0.04676098       -0.1428571
bmi.residuals    -0.5765567 -0.04676098    1.00000000       -0.9819805
height.residuals  0.7205767 -0.14285714   -0.98198051        1.0000000

输出：

> cors
  traits correlation
1    bmi  -0.5765567
2 height  -0.1428571

请注意，只有在原始列位于.residual列之前时才会有效。

Answer 3

这是一个简短的解决方案：

假设您有一个包含变量a，a.resi，b，b.resi的数据框

df <- data.frame(a=c(1:10), b=c(1:10),
              a.resi=c(-1:-10), b.resi=c(-1:-10))

首先，使用所有核心变量（即没有后缀.resi）创建一个向量（名为“core”）

core <- names(df) [1:2]

然后，使用paste0（）

创建另一个包含核心变量和后缀.resi的向量（名为core.resi）

core.resi <- paste0(core, '.resi')

定义一个带有3个参数的函数：数据帧（Data），x和y。这个函数将计算数据帧Data

中给定x和y之间的相关性

MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])

最后，将该函数应用于vectors core和core.resi

mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>% 
data.frame()

Answer 4

你可以尝试一个整合的解决方案：

library(tidyverse)
cor(d[,-1]) %>% 
  as.tibble() %>% 
  add_column(Trait=colnames(.)) %>% 
  gather(key, value, -Trait) %>% 
  rowwise() %>% 
  filter(grepl(paste(Trait, collapse = "|"), key)) %>% 
  filter(Trait != key) %>% 
  ungroup()
# A tibble: 3 x 3
   Trait              key         value
   <chr>            <chr>         <dbl>
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

或者您直接从data.frame开始：

d %>% 
  gather(key, value, -ID) %>% 
  mutate(gr=strtrim(key,2)) %>% 
  split(.$gr) %>% 
  map(~spread(.,key, value)) %>%
  map(~cor(.[-1:-2])[,2]) %>% 
  map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>% 
  bind_rows()  
  Trait1           Trait2           cor
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

R：变量对之间的相关性

4 个答案: