两个数据框中所有交叉点的表

时间:2015-09-28 00:10:33

标签: r

我有两个数据框。数据帧的每一行都有不同数量的元素(实际上是基因名称) - 我用read.csv("file.csv",fill=TRUE)来读取它们,所以在某些行中有一些填充。

每个数据框都具有相同的元素,只有它们被不同地聚类,因此它们位于不同的组中。我想输出两个数据帧的交叉表。

所以,如果

df1<-data.frame(c("a","b","NA","NA"),c("c","d","e","f"),c("g","h","i","NA" ),c("j","NA","NA","NA"))
df2<-data.frame(c("c","e","i","NA"),c("f","g","h","NA"),c("a","b","d","j" ))

然后我想得到这样的东西:

     df1[1,] df1[2,] df1[3,] df1[4,]
df2[1,] 0      2       1      0
df2[2,] 0      1       2      0
df2[3,] 2      1       0      1

看起来它应该是我应该能够用intersect()和某种类型的apply函数做的事情。我不能理解它。使用我的google-fu,我能找到的最近的是:Finding an efficient way to count the number of overlaps between interval sets in two tables?,但它处理的是数据表,并且正在查看线段中的数字重叠,这是我能说的最好,而不是名单。

有谁知道怎么做?

1 个答案:

答案 0 :(得分:3)

您可以通过循环遍历每个数据框的行,然后计算行的交集长度来省略缺失值:

apply(df1, 1, function(i) apply(df2, 1, function(j) length(na.omit(intersect(i, j)))))
#      [,1] [,2] [,3] [,4]
# [1,]    0    2    1    0
# [2,]    0    1    2    0
# [3,]    2    1    0    1

示例数据:

(df1<-rbind(c("a","b", NA, NA),c("c","d","e","f"),c("g","h","i", NA),c("j", NA, NA, NA)))
#      [,1] [,2] [,3] [,4]
# [1,] "a"  "b"  NA   NA  
# [2,] "c"  "d"  "e"  "f" 
# [3,] "g"  "h"  "i"  NA  
# [4,] "j"  NA   NA   NA  
(df2<-rbind(c("c","e","i", NA),c("f","g","h", NA),c("a","b","d","j")))
#      [,1] [,2] [,3] [,4]
# [1,] "c"  "e"  "i"  NA  
# [2,] "f"  "g"  "h"  NA  
# [3,] "a"  "b"  "d"  "j"