我有以下问题。我的小标题中有两个ID,我需要为整个组创建一个ID,这两个ID中有链接。该示例将清除所有内容:
library(dplyr)
df
# A tibble: 12 x 4
id_1 id_2 val res_col
<chr> <chr> <dbl> <chr>
1 G NA 1.01 G
2 G NA -0.255 G
3 G NA 0.595 G
4 Z G -0.881 G
5 Z G -0.127 G
6 Z G 0.399 G
7 R NA 0.749 R
8 R NA -0.447 R
9 R NA -1.70 R
10 D Z 0.118 G
11 D Z 0.000169 G
12 D Z -0.522 G
这是我需要的示例和结果列。问题是id_1
是我的原始ID,而id_2
是我的辅助ID,它告诉了与原始ID链接的辅助ID。因此,G本身在其中,Z本身与G相连,R本身是D,而D实际上实际上一直通过Z一直与G相连。我想为每个组首先获得id_1
。我无法对表格进行排序以实现某种滞后/超前关系。 res_col
是我想要的。
编辑1,我的原始数据中可能有数十个此类链接。 编辑2,我有超过10万条记录,无法真正知道真正的链接。
数据:
df <- tibble(id_1 = c(rep("G", 3), rep("Z", 3), rep("R", 3), rep("D", 3)),
id_2 = c(rep(NA, 3), rep("G", 3), rep(NA, 3), rep("Z", 3)),
val = rnorm(n = 12),
res_col = c(rep("G", 6), rep("R", 3), rep("G", 3)))
我的想法和当前正在尝试的内容。我稍微清理了igraph用法,也许还有更好的用法,但是现在将继续使用。谢谢。
library(dplyr)
library(igraph)
df <- tibble(id_1 = c(rep("G", 3), rep("Z", 3), rep("R", 3), rep("D", 3)),
id_2 = c(rep(NA, 3), rep("G", 3), rep(NA, 3), rep("Z", 3)),
val = rnorm(n = 12),
res_col = c(rep("G", 6), rep("R", 3), rep("G", 3)))
groups <- df %>%
select(id_1, id_2) %>%
mutate(id_2 = case_when(is.na(id_2) ~ id_1,
TRUE ~ id_2)) %>%
graph_from_data_frame(.) %>%
components(.) %>%
.$membership %>%
tibble(id_1 = names(.),
group = .)
groups %>%
group_by(group) %>%
mutate(group_id = id_1[1]) %>%
ungroup() %>%
select(id_1, group_id) %>%
right_join(df, by = "id_1")
# A tibble: 12 x 5
id_1 group_id id_2 val res_col
<chr> <chr> <chr> <dbl> <chr>
1 G G NA 1.06 G
2 G G NA -0.908 G
3 G G NA 0.320 G
4 Z G G -0.733 G
5 Z G G 1.10 G
6 Z G G 1.50 G
7 R R NA -2.28 R
8 R R NA 0.201 R
9 R R NA 0.641 R
10 D G Z 1.54 G
11 D G Z 0.160 G
12 D G Z -0.430 G
答案 0 :(得分:0)
我已经按照与您相同的方式对待您的问题:使用图表方法。在这里,我只是提供一种替代方法来处理数据。对于后半部分,我使用data.table
-尽管不是绝对必要,但我发现它很方便。
library(data.table)
library(igraph)
# convert data.frame to data.table
setDT(df)
# make a copy of id_2 column
df[ , id_22 := id_2]
# where id_2 is NA, set id_22 to id_1
# these vertices correspond to the 'end points' with loop edges in the graph
df[is.na(id_2), id_22 := id_1]
# convert 'edge list' of id_1 and id_22 to a graph
g <- graph_from_data_frame(df[!duplicated(id_1), .(id_1, id_22)])
# get graph components and their named membership id
mem <- components(g)$membership
# convert to data.table
d <- data.table(id_1 = names(mem), mem = mem)
# add membership id to original data
df[ , mem := d[.SD, on = .(id_1), mem]]
# create result column
# for each graph component:
# where id_22 equals id_1 (i.e. the loop edges in the graph), select first id_22 value
df[ , res := id_22[id_22 == id_1][1], by = mem]
如果需要,请删除辅助列:
df[ , `:=`(id_22 = NULL, mem = NULL)]
df
# id_1 id_2 val res_col res
# 1: G <NA> 0.27665785 G G
# 2: G <NA> 0.81840992 G G
# 3: G <NA> 0.19928880 G G
# 4: Z G -0.09706282 G G
# 5: Z G -0.02744784 G G
# 6: Z G 0.19084119 G G
# 7: R <NA> 0.59491323 R R
# 8: R <NA> -0.04785416 R R
# 9: R <NA> 0.55550640 R R
# 10: D Z -0.76006272 G G
# 11: D Z 0.33305465 G G
# 12: D Z -0.04037541 G G
plot(g, vertex.size = 20, edge.arrow.size = 0.5)