我的原始数据mydf(没有重复):
group hed_pfnpi id
1: aa 111111 18
2: aa 111111 17
3: aa 222222 18
4: aa 333333 14
5: aa 444444 13
6: aa 555555 18
7: aa 555555 24
8: aa 222222 13
9: aa 222222 17
10: aa 333333 17
11: bb 666666 9
12: bb 666666 3
13: bb 888888 9
14: bb 999999 14
15: bb 444444 13
16: bb 555555 9
17: bb 555555 24
18: bb 888888 13
19: bb 888888 3
20: bb 999999 3
我想将mydf转移到结果表:
group one two weight id_list
1 aa 111111 222222 2 17,18
2 aa 111111 333333 1 17
3 aa 111111 555555 1 18
4 aa 222222 333333 1 17
5 aa 222222 444444 1 13
6 aa 222222 555555 1 18
7 bb 444444 888888 1 13
8 bb 555555 666666 1 9
9 bb 555555 888888 1 9
10 bb 666666 888888 2 3,9
11 bb 666666 999999 1 3
12 bb 888888 999999 1 3
首先,根据组列按数据分组
如果hed_pfnpi共享相同的id,它们将成为结果表中的一对;
id_list:相应的共享ID;
重量:id_list的长度
library(data.table)
library(dplyr)
library(magrittr)
library(tidyverse)
mydf1 <- data.table(structure(list(group = rep("aa",10),hed_pfnpi = c(111111L, 111111L, 222222L, 333333L, 444444L,
555555L, 555555L, 222222L, 222222L, 333333L), id = c(18L, 17L,
18L, 14L, 13L, 18L, 24L, 13L, 17L, 17L)), .Names = c("group","hed_pfnpi", "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf2 <- data.table(structure(list(group = rep("bb",10),hed_pfnpi = c(666666L, 666666L, 888888L, 999999L, 444444L,
555555L, 555555L, 888888L, 888888L, 999999L), id = c(9L, 3L,
9L, 14L, 13L, 9L, 24L, 13L, 3L, 3L)), .Names = c("group","hed_pfnpi", "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf <- rbind(mydf1,mydf2)
# try code
result <- merge(mydf, mydf, by = "id", allow.cartesian=TRUE) %>%
filter(group.x == group.y) %>%
transmute(group = group.x,
one = pmin(hed_pfnpi.x, hed_pfnpi.y),
two = pmax(hed_pfnpi.x, hed_pfnpi.y),
id) %>%
filter(one != two) %>%
unique() %>%
group_by(group,one, two) %>%
summarise(id_list = paste(id, collapse = ","),
weight = n()) %>%
select(group,one, two,weight, id_list)
我的尝试代码在这里,它可以获得预期的结果,但它效率低(当数据很大时崩溃)。希望有人能为我提供更好的解决方案。
答案 0 :(得分:2)
我做(只加载data.table而不是其他软件包)...
mydf[,
CJ(one = hed_pfnpi, two = hed_pfnpi)[one < two]
, keyby=.(group, id)][,
.(n = .N, ids = toString(id))
, keyby=.(group, one, two)]
给出了
group one two n ids
1: aa 111111 222222 2 17, 18
2: aa 111111 333333 1 17
3: aa 111111 555555 1 18
4: aa 222222 333333 1 17
5: aa 222222 444444 1 13
6: aa 222222 555555 1 18
7: bb 444444 888888 1 13
8: bb 555555 666666 1 9
9: bb 555555 888888 1 9
10: bb 666666 888888 2 3, 9
11: bb 666666 999999 1 3
12: bb 888888 999999 1 3