R:传递data.table的模式

时间:2017-09-13 19:50:45

标签: r dataframe dplyr

我的原始数据mydf(没有重复):

    group hed_pfnpi id
 1:    aa    111111 18
 2:    aa    111111 17
 3:    aa    222222 18
 4:    aa    333333 14
 5:    aa    444444 13
 6:    aa    555555 18
 7:    aa    555555 24
 8:    aa    222222 13
 9:    aa    222222 17
10:    aa    333333 17
11:    bb    666666  9
12:    bb    666666  3
13:    bb    888888  9
14:    bb    999999 14
15:    bb    444444 13
16:    bb    555555  9
17:    bb    555555 24
18:    bb    888888 13
19:    bb    888888  3
20:    bb    999999  3

我想将mydf转移到结果表:

   group    one    two weight id_list
 1    aa 111111 222222      2   17,18
 2    aa 111111 333333      1      17
 3    aa 111111 555555      1      18
 4    aa 222222 333333      1      17
 5    aa 222222 444444      1      13
 6    aa 222222 555555      1      18
 7    bb 444444 888888      1      13
 8    bb 555555 666666      1       9
 9    bb 555555 888888      1       9
10    bb 666666 888888      2     3,9
11    bb 666666 999999      1       3
12    bb 888888 999999      1       3

首先,根据组列按数据分组

如果hed_pfnpi共享相同的id,它们将成为结果表中的一对;

id_list:相应的共享ID;

重量:id_list的长度

library(data.table)
library(dplyr)
library(magrittr)
library(tidyverse)


mydf1 <- data.table(structure(list(group = rep("aa",10),hed_pfnpi = c(111111L, 111111L, 222222L, 333333L, 444444L, 
                                           555555L, 555555L, 222222L, 222222L, 333333L), id = c(18L, 17L, 
                                                                                                  18L, 14L, 13L, 18L, 24L, 13L, 17L, 17L)), .Names = c("group","hed_pfnpi",                                                                                                                                                        "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf2 <- data.table(structure(list(group = rep("bb",10),hed_pfnpi = c(666666L, 666666L, 888888L, 999999L, 444444L, 
                                            555555L, 555555L, 888888L, 888888L, 999999L), id = c(9L, 3L, 
                                                                                                   9L, 14L, 13L, 9L, 24L, 13L, 3L, 3L)), .Names = c("group","hed_pfnpi",                                                                                                                                                     "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf <- rbind(mydf1,mydf2)


# try code
result <- merge(mydf, mydf, by = "id", allow.cartesian=TRUE) %>% 
  filter(group.x == group.y) %>%
  transmute(group = group.x,
            one = pmin(hed_pfnpi.x, hed_pfnpi.y),
            two = pmax(hed_pfnpi.x, hed_pfnpi.y),
            id) %>% 
  filter(one != two) %>% 
  unique() %>% 
  group_by(group,one, two) %>% 
  summarise(id_list = paste(id, collapse = ","),
            weight = n()) %>%
  select(group,one, two,weight, id_list)

我的尝试代码在这里,它可以获得预期的结果,但它效率低(当数据很大时崩溃)。希望有人能为我提供更好的解决方案。

1 个答案:

答案 0 :(得分:2)

我做(只加载data.table而不是其他软件包)...

mydf[, 
  CJ(one = hed_pfnpi, two = hed_pfnpi)[one < two]
, keyby=.(group, id)][, 
  .(n = .N, ids = toString(id))
, keyby=.(group, one, two)]

给出了

    group    one    two n    ids
 1:    aa 111111 222222 2 17, 18
 2:    aa 111111 333333 1     17
 3:    aa 111111 555555 1     18
 4:    aa 222222 333333 1     17
 5:    aa 222222 444444 1     13
 6:    aa 222222 555555 1     18
 7:    bb 444444 888888 1     13
 8:    bb 555555 666666 1      9
 9:    bb 555555 888888 1      9
10:    bb 666666 888888 2   3, 9
11:    bb 666666 999999 1      3
12:    bb 888888 999999 1      3