Question

我想使用dplyr找到data.frame中各列的等级相关性。

我确信这个问题有一个简单的解决方案，但我认为问题在于我在使用cor函数时无法在dplyr中的summarize_each_中使用两个输入。

对于以下df：

100px

我想获得所有.x和.y组合之间的排名相关性。我在下面的函数中遇到问题????

df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))

我希望 cor 只包括每个宇宙的相关对：AA.x.AA.y，AA.x，BB.y，....

请帮忙！

Answer 1

试试这个：

library(data.table)                                           # needed for fast melt
setDT(df)                                                     # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))

mdf %>% 
  mutate(obs_set = substr(variable, 4, 4) ) %>%               # ".x" or ".y" subgroup
  full_join(.,., by=c('Universe', 'obs_set', 'id')) %>%       # see notes
  group_by(Universe, variable.x, variable.y) %>%
  filter(variable.x != variable.y) %>%
  dplyr::summarise(rank_corr = cor(value.x, value.y, 
                   method='spearman', use='pairwise.complete.obs'))

产地：

   Universe variable.x variable.y rank_corr
     (fctr)     (fctr)     (fctr)     (dbl)
1         A       AA.x       BB.x      -0.9
2         A       AA.x       CC.x      -0.9
3         A       BB.x       AA.x      -0.9
4         A       BB.x       CC.x       0.8
5         A       CC.x       AA.x      -0.9
6         A       CC.x       BB.x       0.8
7         A       AA.y       BB.y      -0.3
8         A       AA.y       CC.y       0.2
9         A       BB.y       AA.y      -0.3
10        A       BB.y       CC.y      -0.3
..      ...        ...        ...       ...

说明：

熔化：将表转换为长形，每次观察一行。要在dplyr链中进行融合，我必须使用tidyr::gather，所以选择你的依赖。使用data.table会更快，也不难理解。该步骤还为每个观察创建id，1到nrow(df)。剩下的就像你想要的dplyr一样。
完全连接：将融合的表连接到自身，根据常见的Universe和观察id 创建所有变量配对的配对观察（编辑：现在'.x'或'.y'小组）。
过滤：我们不需要关联与自身配对的观察，我们知道这些相关性= 1.如果您想将它们包含在相关矩阵中，请注释掉这一步。
< / LI>
使用Spearman相关性进行总结。请注意，您应该使用dplyr::summarise，因为如果您还加载plyr，则可以accidentally致电plyr::summarise。

Answer 2

另一种方法是只调用cor函数一次，因为这将计算所有必需的相关性。重复调用cor可能是大型数据集的性能问题。执行此操作并使用标签提取关联对的代码可能如下所示：

#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
              do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%  
              data.frame(row=rep(colnames(.)[-1], n_groups(.))) 
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <-  which(upper.tri(diag(num_col))) 
cor_cols <- df %>% group_by(Universe) %>%
            do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])

Answer 3

因此，遵循我的问题的获胜（时间）解决方案：

d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>% 
       summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>% 
       unite(Pair, R1, R2, sep="_")

虽然在此示例中为0.005毫秒，但添加数据会增加时间。

在dplyr中使用funs的相关性

3 个答案: