我有两个数据框。数据帧的每一行都有不同数量的元素(实际上是基因名称) - 我用read.csv("file.csv",fill=TRUE)
来读取它们,所以在某些行中有一些填充。
每个数据框都具有相同的元素,只有它们被不同地聚类,因此它们位于不同的组中。我想输出两个数据帧的交叉表。
所以,如果
df1<-data.frame(c("a","b","NA","NA"),c("c","d","e","f"),c("g","h","i","NA" ),c("j","NA","NA","NA"))
df2<-data.frame(c("c","e","i","NA"),c("f","g","h","NA"),c("a","b","d","j" ))
然后我想得到这样的东西:
df1[1,] df1[2,] df1[3,] df1[4,]
df2[1,] 0 2 1 0
df2[2,] 0 1 2 0
df2[3,] 2 1 0 1
看起来它应该是我应该能够用intersect()和某种类型的apply函数做的事情。我不能理解它。使用我的google-fu,我能找到的最近的是:Finding an efficient way to count the number of overlaps between interval sets in two tables?,但它处理的是数据表,并且正在查看线段中的数字重叠,这是我能说的最好,而不是名单。
有谁知道怎么做?
答案 0 :(得分:3)
您可以通过循环遍历每个数据框的行,然后计算行的交集长度来省略缺失值:
apply(df1, 1, function(i) apply(df2, 1, function(j) length(na.omit(intersect(i, j)))))
# [,1] [,2] [,3] [,4]
# [1,] 0 2 1 0
# [2,] 0 1 2 0
# [3,] 2 1 0 1
示例数据:
(df1<-rbind(c("a","b", NA, NA),c("c","d","e","f"),c("g","h","i", NA),c("j", NA, NA, NA)))
# [,1] [,2] [,3] [,4]
# [1,] "a" "b" NA NA
# [2,] "c" "d" "e" "f"
# [3,] "g" "h" "i" NA
# [4,] "j" NA NA NA
(df2<-rbind(c("c","e","i", NA),c("f","g","h", NA),c("a","b","d","j")))
# [,1] [,2] [,3] [,4]
# [1,] "c" "e" "i" NA
# [2,] "f" "g" "h" NA
# [3,] "a" "b" "d" "j"