我的数据集的简化版本可以通过以下方式复制:
df <- data.frame(buyer = c("A","C","B"),
seller = c("B","D","E"),
amount = c(1,2,3))
我正在寻找一种更好的dplyr解决方案来实现以下目标。
buyer seller amount
A B 1
C D 2
B E 3
应该为每个座席(A,B,C,D,E)生成汇总摘要
output
agent total_amount
A 1
B 4 #(=1+3)
C 2
D 2
我可以对买卖双方进行group_by,然后添加结果,但这并不优雅,而且有些麻烦。
library(dplyr)
res_b <- df %>%
group_by(buyer) %>%
summarise(total_amount=sum(amount))
res_s <- df %>%
group_by(seller) %>%
summarise(total_amount=sum(amount))
感谢您的帮助。其他解决方案(不在tidyverse中)显然也受到欢迎。
编辑:应该说我的原始数据集大约有 60 个观测值。
答案 0 :(得分:5)
我们可以先转换为长格式并进行简单的汇总,即
library(tidyverse)
df %>%
gather(var, agent, -amount) %>%
group_by(agent) %>%
summarise(total_amount = sum(amount))
给出,
# A tibble: 5 x 2 agent total_amount <chr> <dbl> 1 A 1 2 B 4 3 C 2 4 D 2 5 E 3
您可以尝试data.table
来提高效率。这是上面tidyverse
代码的直接翻译
library(data.table)
dt1 <- setDT(df)
melt(dt1, measure.vars = c('buyer', 'seller'), id.vars = 'amount', value.name = "agent"
)[, .(total_amount = sum(amount)), by = agent][]
# agent total_amount
#1: A 1
#2: C 2
#3: B 4
#4: D 2
#5: E 3
答案 1 :(得分:4)
基准化
library(bench)
bnch <-
press(
n = 10^c(5, 6, 7, 8),{
set.seed(1);df_big <- data.frame(buyer = sample(LETTERS, n, replace = TRUE), seller = sample(LETTERS, n, replace = TRUE), amount = sample(1:10, n, replace = TRUE))
set.seed(1);dt_big <- data.table(buyer = sample(LETTERS, n, replace = TRUE), seller = sample(LETTERS, n, replace = TRUE), amount = sample(1:10, n, replace = TRUE))
mark(
dplyr = {
df_big %>%
gather(var, agent, -amount) %>%
group_by(agent) %>%
summarise(total_amount = sum(amount))},
dt_melt = {
melt(dt_big, measure.vars = c('buyer', 'seller'), id.vars = 'amount')[
, .(total_amount = sum(amount)), by = .(agent = value) ][order(agent), ]},
dt_rbind = {
rbind(dt_big[ , .(x = sum(amount)), by = .(agent = buyer) ],
dt_big[ , .(x = sum(amount)), by = .(agent = seller) ])[
order(agent), .(total_amount = sum(x)), by = agent]}
)})
bnch
# # A tibble: 12 x 15
# expression n min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
# 1 dplyr 1.00e5 15.75ms 16.4ms 15.85ms 22.7ms 61.0 6.88MB 0 31
# 2 dt_melt 1.00e5 6.34ms 8.39ms 8.48ms 9.2ms 119. 7.01MB 1 53
# 3 dt_rbind 1.00e5 7.45ms 7.82ms 7.75ms 8.9ms 128. 4.06MB 0 64
# 4 dplyr 1.00e6 149.07ms 159.32ms 160.07ms 168.06ms 6.28 68.68MB 0 4
# 5 dt_melt 1.00e6 49.85ms 58.88ms 60.52ms 62.58ms 17.0 69.34MB 1 7
# 6 dt_rbind 1.00e6 35.73ms 38.05ms 38.61ms 40.01ms 26.3 39.09MB 1 12
# 7 dplyr 1.00e7 1.78s 1.78s 1.78s 1.78s 0.560 686.66MB 2 1
# 8 dt_melt 1.00e7 648.77ms 648.77ms 648.77ms 648.77ms 1.54 692.61MB 1 1
# 9 dt_rbind 1.00e7 389.32ms 390.37ms 390.37ms 391.41ms 2.56 387.54MB 3 2
# 10 dplyr 1.00e8 18.73s 18.73s 18.73s 18.73s 0.0534 6.71GB 3 1
# 11 dt_melt 1.00e8 8.18s 8.18s 8.18s 8.18s 0.122 6.76GB 2 1
# 12 dt_rbind 1.00e8 4.15s 4.15s 4.15s 4.15s 0.241 3.78GB 1 1
ggplot2::autoplot(bnch)
答案 2 :(得分:4)
正如您提到的"60 million observations"
,这是使用data.table
的另一种解决方案,使用 rbind 代替 melt :
library(data.table)
setDT(df)
rbind(df[ , .(x = sum(amount)), by = .(agent = buyer) ],
df[ , .(x = sum(amount)), by = .(agent = seller) ])[
, .(total_amount = sum(x)), by = agent]
# agent total_amount
# 1: A 1
# 2: C 2
# 3: B 4
# 4: D 2
# 5: E 3
答案 3 :(得分:2)
访问两次行并按c(buyer, seller)
分组:
# setup
library(data.table)
setDT(df)
df[, c("buyer", "seller") := .(as.character(buyer), as.character(seller))]
# aggregate
df[rep(1:.N, 2), .(total = sum(amount)), by=.(agent = c(df$buyer, df$seller))]
agent total
1: A 1
2: C 2
3: B 4
4: D 2
5: E 3
我认为,由于进行了积极的NSE解析,因此需要df$
这个东西。我不确定by=
或keyby=
在这里是否应该更快。
基准测试:我用zx8的数据进行了尝试,发现如果重新设置为...,速度大约是rbind
的两倍。
dt_big[, data.table(agent = c(buyer,seller), v = amount)][, sum(v), by=agent]
# 7.4 seconds vs 4.0 for dt_rbind with n = 10^8
最后,又一个快速但冗长的选项:
groupingsets(dt_big,
by=c("buyer", "seller"),
sets = list("buyer", "seller"),
j = sum(amount))[is.na(buyer), buyer := seller][, sum(V1), by=buyer])
# 4.2 seconds