我有以下类型的数据框(这是简化的示例):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
在此数据框中,您可以看到对于某些ID,存在多个存储体,即id==1
,bank=c(a,b,c)
。
我想从此数据框中提取的信息是不同库中的id与计数之间的重叠。
例如,对于银行a
:银行a
有两个人(唯一的ID):1和4。对于这些人,我想知道他们还拥有哪些其他银行
其他银行的总数:3,其中b = 1,c = 2。
所以我想创建一个如下的重叠表作为输出:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
答案 0 :(得分:1)
花点时间获取结果,所以我将其发布。不像罗纳克·沙赫斯(Ronak Shahs)那样性感,但结果相同。
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
答案 1 :(得分:1)
选项为full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
答案 2 :(得分:1)
我们可以使用data.table
:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
由reprex package(v0.3.0)于2019-07-01创建
请注意,b
的{{1}}与其他值不重叠。如果您不希望在最终产品中使用它,只需在输出中应用id==2
。
答案 3 :(得分:0)
您是否需要同时覆盖两个银行?由于在这种情况下,a-> b与b-> a相同。我们可以使用combn
并创建一次取2的唯一bank
的组合,找出在该组合中发现的length
共同的id
。
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
数据
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)