我有这种类型的数据 我现在只有两个组,用户可以是单个/多个组的一部分。有大约10万用户
id grp
1001 A
1001 B
1002 A
1002 A
1003 B
1003 B
我想创建具有仅来自grp A / B等的记录的表。
根据以上数据
,概念输出如下ONLY A 1002
ONLY B 1003
BOTH 1001
请分享R解决方案和更好的data.table或sql方式
根据@Procrastinatus Maximus的答案,以下是一些测试结果
N <- 100000
set.seed(1)
DT <- data.table(
id = sample(N/2, N, TRUE),
grp = sample(c('A','B'), N, TRUE)
)[order(id)]
DT
DT[,.N, by=grp]
grp N
1: B 50170
2: A 49830
这种方法解决了我的问题(由@ Procrastinatus Maximus建议), 与其他
相比,为什么这需要花费很多时间> system.time(DT[, .SD[uniqueN(grp)==1], by = id])
user system elapsed
31.064 0.147 31.244
> system.time(DT[, .(grps = toString(unique(grp))), by = id])
user system elapsed
2.022 0.011 1.987
> system.time(unique(DT)[order(grp), .(grps = toString(grp)), by = id])
user system elapsed
0.707 0.003 0.710
> system.time(DT[, list(grp = paste(grp, collapse = " | ")), by = id])
user system elapsed
0.244 0.001 0.245
> system.time(aggregate(grp ~ id, DT, function(x) toString(unique(x))))
user system elapsed
2.673 0.004 2.680
> system.time(sqldf('select id, group_concat(distinct grp) from DT group by id'))
user system elapsed
0.445 0.000 0.445
答案 0 :(得分:4)
假设您的数据已经在data.table
中(如果没有,请将其转换为data.table
setDT(name_of_your_dataframe)
):
library(data.table)
# option 1
unique(DT)[, .(grps = toString(grp)), by = id]
# option 2
DT[, .(grps = toString(unique(grp))), by = id]
给出:
id grps
1: 1001 A, B
2: 1002 A
3: 1003 B
根据@Frank的建议:要获得相同的序列,最好按grp
列排序:
unique(DT)[order(grp), .(grps = toString(grp)), by = id]
其他几种选择:
1)基础R:
aggregate(grp ~ id, DT, function(x) toString(unique(x)))
2) dplyr
:
library(dplyr)
DT %>% group_by(id) %>% summarise(grps = toString(unique(grp)))
3) sqldf
:
library(sqldf)
sqldf('select id, group_concat(distinct grp) from DT group by id')