给出如下的R数据框:
DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"),
ID2 = c("D",NA,"G",NA,NA,NA,"H",NA),
ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))
> DF.a
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G <NA>
4 D <NA> <NA>
5 E <NA> <NA>
6 F <NA> <NA>
7 G H <NA>
8 H <NA> <NA>
我想将其简化/重塑为以下内容:
DF.b <- data.frame(ID1 = c("A","B","C","E"),
ID2 = c("D",NA,"G",NA),
ID3 = c("F",NA,"H",NA))
> DF.b
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G H
4 E <NA> <NA>
这似乎不是一个简单的重塑。目标是将所有“连接”ID值一起放在一行上。注意“C”和“H”之间的连接是间接的,因为它们都连接到“G”,但是它们不会一起出现在DF.a的同一行上。 DF.b行中ID值的顺序无关紧要。
答案 0 :(得分:4)
你真的可以把它想象成试图获取图形的所有连通组件。我将采取的第一步是将您的数据转换为更自然的结构 - 节点矢量和边缘矩阵:
(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
x <- x[!is.na(x)]
cbind(head(x, -1), tail(x, -1))
})))
# [,1] [,2]
# ID1 "A" "D"
# ID2 "D" "F"
# ID1 "C" "G"
# ID1 "G" "H"
现在您已准备好构建图表并计算其组件:
library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
#
# $`2`
# [1] "B"
#
# $`3`
# [1] "C" "G" "H"
#
# $`4`
# [1] "E"
split
函数的输出是一个列表,其中每个列表元素都是图形组件之一中的所有节点。我个人认为这是输出数据中最有用的表示,但是如果你真的想要你描述的NA填充结构,你可以尝试类似的东西:
max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
# [,1] [,2] [,3]
# 1 "A" "D" "F"
# 2 "B" NA NA
# 3 "C" "G" "H"
# 4 "E" NA NA