连结R

时间:2019-07-02 10:53:27

标签: r dplyr grouping igraph

我有以下问题。我的小标题中有两个ID,我需要为整个组创建一个ID,这两个ID中有链接。该示例将清除所有内容:

library(dplyr)
df
# A tibble: 12 x 4
   id_1  id_2        val res_col
   <chr> <chr>     <dbl> <chr>  
 1 G     NA     1.01     G      
 2 G     NA    -0.255    G      
 3 G     NA     0.595    G      
 4 Z     G     -0.881    G      
 5 Z     G     -0.127    G      
 6 Z     G      0.399    G      
 7 R     NA     0.749    R      
 8 R     NA    -0.447    R      
 9 R     NA    -1.70     R      
10 D     Z      0.118    G      
11 D     Z      0.000169 G      
12 D     Z     -0.522    G 

这是我需要的示例和结果列。问题是id_1是我的原始ID,而id_2是我的辅助ID,它告诉了与原始ID链接的辅助ID。因此,G本身在其中,Z本身与G相连,R本身是D,而D实际上实际上一直通过Z一直与G相连。我想为每个组首先获得id_1。我无法对表格进行排序以实现某种滞后/超前关系。 res_col是我想要的。

编辑1,我的原始数据中可能有数十个此类链接。 编辑2,我有超过10万条记录,无法真正知道真正的链接。

数据:

df <- tibble(id_1 = c(rep("G", 3), rep("Z", 3), rep("R", 3), rep("D", 3)),
             id_2 = c(rep(NA, 3), rep("G", 3), rep(NA, 3), rep("Z", 3)),
             val = rnorm(n = 12),
             res_col = c(rep("G", 6), rep("R", 3), rep("G", 3)))

我的想法和当前正在尝试的内容。我稍微清理了igraph用法,也许还有更好的用法,但是现在将继续使用。谢谢。

library(dplyr)
library(igraph)

df <- tibble(id_1 = c(rep("G", 3), rep("Z", 3), rep("R", 3), rep("D", 3)),
             id_2 = c(rep(NA, 3), rep("G", 3), rep(NA, 3), rep("Z", 3)),
             val = rnorm(n = 12),
             res_col = c(rep("G", 6), rep("R", 3), rep("G", 3)))

groups <- df %>%
  select(id_1, id_2) %>% 
  mutate(id_2 = case_when(is.na(id_2) ~ id_1,
                          TRUE ~ id_2)) %>% 
  graph_from_data_frame(.) %>% 
  components(.) %>% 
  .$membership %>% 
  tibble(id_1 = names(.),
         group = .)

groups %>% 
  group_by(group) %>% 
  mutate(group_id = id_1[1]) %>% 
  ungroup() %>% 
  select(id_1, group_id) %>% 
  right_join(df, by = "id_1")

# A tibble: 12 x 5
   id_1  group_id id_2     val res_col
   <chr> <chr>    <chr>  <dbl> <chr>  
 1 G     G        NA     1.06  G      
 2 G     G        NA    -0.908 G      
 3 G     G        NA     0.320 G      
 4 Z     G        G     -0.733 G      
 5 Z     G        G      1.10  G      
 6 Z     G        G      1.50  G      
 7 R     R        NA    -2.28  R      
 8 R     R        NA     0.201 R      
 9 R     R        NA     0.641 R      
10 D     G        Z      1.54  G      
11 D     G        Z      0.160 G      
12 D     G        Z     -0.430 G  

1 个答案:

答案 0 :(得分:0)

我已经按照与您相同的方式对待您的问题:使用图表方法。在这里,我只是提供一种替代方法来处理数据。对于后半部分,我使用data.table-尽管不是绝对必要,但我发现它很方便。

library(data.table)
library(igraph)

# convert data.frame to data.table
setDT(df)

# make a copy of id_2 column
df[ , id_22 := id_2]

# where id_2 is NA, set id_22 to id_1
# these vertices correspond to the 'end points' with loop edges in the graph
df[is.na(id_2), id_22 := id_1]

# convert 'edge list' of id_1 and id_22 to a graph
g <- graph_from_data_frame(df[!duplicated(id_1), .(id_1, id_22)])

# get graph components and their named membership id 
mem <- components(g)$membership

# convert to data.table
d <- data.table(id_1 = names(mem), mem = mem)

# add membership id to original data
df[ , mem := d[.SD, on = .(id_1), mem]] 

# create result column 
# for each graph component:
# where id_22 equals id_1 (i.e. the loop edges in the graph), select first id_22 value  
df[ , res := id_22[id_22 == id_1][1], by = mem]

如果需要,请删除辅助列:

df[ , `:=`(id_22 = NULL, mem = NULL)]

df
#     id_1 id_2         val res_col res
#  1:    G <NA>  0.27665785       G   G
#  2:    G <NA>  0.81840992       G   G
#  3:    G <NA>  0.19928880       G   G
#  4:    Z    G -0.09706282       G   G
#  5:    Z    G -0.02744784       G   G
#  6:    Z    G  0.19084119       G   G
#  7:    R <NA>  0.59491323       R   R
#  8:    R <NA> -0.04785416       R   R
#  9:    R <NA>  0.55550640       R   R
# 10:    D    Z -0.76006272       G   G
# 11:    D    Z  0.33305465       G   G
# 12:    D    Z -0.04037541       G   G

plot(g, vertex.size = 20, edge.arrow.size = 0.5)

enter image description here