我正在寻找R中的解决方案,将大型表中的相同行组合(例如1000乘300),忽略NA(找到相同的行时),用非NA值填充NA值其他匹配的行。最终为每个小组建立共识。还有一些行可以匹配/属于多个组,并且应该分配给所有这样的组。
例如:
data <- rbind(c("A", "A", "B", "C", NA, NA),
c("A", "A", "B", "NA", NA, NA),
c("B", "B", "C", "B", NA, NA),
c(NA, NA, NA, NA, "D", NA),
c(NA, NA, "B", "C", "D", "D"),
c("B", NA, NA, NA, NA, "C"),
c(NA, NA, NA, "B", "D", "C"))
data
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A" "A" "B" "C" NA NA
# [2,] "A" "A" "B" "NA" NA NA
# [3,] "B" "B" "C" "B" NA NA
# [4,] NA NA NA NA "D" NA
# [5,] NA NA "B" "C" "D" "D"
# [6,] "B" NA NA NA NA "C"
# [7,] NA NA NA "B" "D" "C"
第1,2,4,5行和第3,4,6,7行应分组以形成以下两个共识;
# (1,2,4,5) "A" "A" "B" "C" "D" "D"
# (3,4,6,7) "B" "B" "C" "B" "D" "C"
我考虑过使用dplyr group_by,但是由于列(所有列)在组中存在NA值,所以遗憾的是这不起作用。
任何建议都将不胜感激。
非常感谢!
答案 0 :(得分:0)
我不知道您选择1,2,4,5
和3,4,6,7
的原因或原因,但我会用它们来为您提供所需的结果。
A=lapply(apply(data,1,list),unlist)
t(sapply(list(c(1,2,4,5),c(3,4,6,7)),function(x)coalesce(!!! A[x])))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "A" "A" "B" "C" "D" "D"
[2,] "B" "B" "C" "B" "D" "C"
答案 1 :(得分:0)
我已经开发出一种解决方案,但我确信还有其他方法,可能是更有效的方法来实现相同的结果。
以下代码首先查找行之间的相同匹配项。对于每一行,为每个成对相同的匹配制作成对的组。然后使用dplyr合并每个组中的2行来替换NA。然后合并的行在合并之前替换行,并且循环过程直到在行之间找不到更多相同的匹配。
library(dplyr)
library(tidyr)
data <- rbind(c("A", "A", "B", "C", NA, NA),
c("A", "A", "B", NA, NA, NA),
c("B", "B", "C", "B", NA, NA),
c(NA, NA, NA, NA, "D", NA),
c(NA, NA, "B", "C", "D", "D"),
c("B", NA, NA, NA, NA, "C"),
c(NA, NA, NA, "B", "D", "C"))
data <- as_tibble(data)
data
# A tibble: 7 x 6
V1 V2 V3 V4 V5 V6
<chr> <chr> <chr> <chr> <chr> <chr>
1 A A B C <NA> <NA>
2 A A B <NA> <NA> <NA>
3 B B C B <NA> <NA>
4 <NA> <NA> <NA> <NA> D <NA>
5 <NA> <NA> B C D D
6 B <NA> <NA> <NA> <NA> C
7 <NA> <NA> <NA> B D C
merge2x <- function(x, data_ident, data){ #pairwise merging of matching rows
idx <- which(data_ident[,x]==T) #index of which rows match x
idx <- idx[-which(idx==x)]
if(length(idx)!=0){
grp <- sort(c(1:length(idx),1:length(idx))) #pairwise grp ids
idx <- as.vector(rbind(x, idx)) #index of pairwise groups of x and every matching row
data2 <- cbind(grp, data[idx,])
#use dplyr to merge rows and fill in NAs within groups
data2 <- data2 %>%
group_by(grp) %>%
summarise_all(funs(first(na.omit(.)))) %>%
mutate_all(as.character)
return(data2[!duplicated(data2[,-1]),-1])
}else{
return(data[x,])
}
}
repeat{ #loop merging pairwise matches between rows until now more rows can be merged
data_ident <- apply(data, 1, function(x) (colSums(!(t(data)==x), na.rm=T)==0 & colSums((t(data)==x), na.rm=T)>=1) ) #logical matrix of which rows are identical
if(sum(data_ident[lower.tri(data_ident)])==0){
break
}
data2 <- bind_rows(lapply(c(1:ncol(data_ident)), merge2x, data_ident, data))
data <- data2[!duplicated(data2),]
}
data
# A tibble: 2 x 6
V1 V2 V3 V4 V5 V6
<chr> <chr> <chr> <chr> <chr> <chr>
1 A A B C D D
2 B B C B D C