Question

我试图根据R中的某些存在/不存在数据建立邻接矩阵或边缘列表。我有一个非常大的数据帧（~12k obs of 196变量）看起来有点像这样：

test_input<-data.frame(sample_ID=c("samp1","samp2","samp3","samp4","samp5","samp6","samp7"),
                       sp1 = c(1,0,0,1,1,0,1),
                       sp2 = c(1,0,0,1,1,1,1),
                       sp3 = c(0,1,1,0,0,0,1),
                       sp4 = c(0,1,1,0,0,1,0), stringsAsFactors = FALSE)
> test_input
  sample_ID sp1 sp2 sp3 sp4
1     samp1   1   1   0   0
2     samp2   0   0   1   1
3     samp3   0   0   1   1
4     samp4   1   1   0   0
5     samp5   1   1   0   0
6     samp6   0   1   0   1
7     samp7   1   1   1   0

我想要得到这样的东西：

> test_output
  col1 col2 freq
1  sp1  sp2    4
2  sp3  sp4    2
3  sp2  sp4    1
4  sp1  sp3    1
5  sp2  sp3    1

我已经看到了一些嵌套for循环方法like the one here但是对于数据帧，我有这些非常慢（运行的天/周）并且每个样本都会生成每个可能存在/不存在的数据帧。

关于我如何解决这个问题的任何建议？优选地，以矢量化/ tidyverse类型的方式。

谢谢！

Answer 1

您可以使用combn尝试此方法;取所有sp列的2个组合并计算其内积，这给出了共现的频率：

names <- combn(names(test_input[-1]), 2)
freq <- combn(test_input[-1], 2, function(x) sum(x[1] * x[2]))

data.frame(col1 = names[1,], col2 = names[2,], freq = freq)

#  col1 col2 freq
#1  sp1  sp2    4
#2  sp1  sp3    1
#3  sp1  sp4    0
#4  sp2  sp3    1
#5  sp2  sp4    1
#6  sp3  sp4    2

_{注意：这包括一起发生零次的对，如果不需要则将其过滤掉。}

将数据帧转换为边缘列表

1 个答案: