我有以下data.table
:
> dt = data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
> dt
sales_ccy sales_amt cost_ccy cost_amt
1: USD 500 GBP -100
2: EUR 600 USD -200
3: GBP 700 GBP -300
4: USD 800 USD -400
我的目标是获得以下data.table
:
> dt
ccy total_amt
1: EUR 600
2: GBP 300
3: USD 700
基本上,我想按货币汇总所有成本和销售额。实际上,这个data.table
有> 500,000行,所以我想要一种快速有效的方法来总结这些数量。
想要快速做到这一点的想法吗?
答案 0 :(得分:9)
使用data.table v1.9.6+
,其melt
的改进版本可以同时融入多个列,
require(data.table) # v1.9.6+
melt(dt, measure = patterns("_ccy$", "_amt$")
)[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]
答案 1 :(得分:7)
您可以从我的" splitstackshape"中考虑txtClient.setText("");
txtDate.setText("");
txtHour.setText("");
封装
在这里,我还使用了" dplyr"如果您愿意,可以跳过它。
merged.stack
" data.table"的开发版本应该能够处理熔化的色谱柱。它也比library(dplyr)
library(splitstackshape)
dt %>%
mutate(id = 1:nrow(dt)) %>%
merged.stack(var.stub = c("ccy", "amt"), sep = "var.stubs", atStart = FALSE) %>%
.[, .(total_amt = sum(amt)), by = ccy]
# ccy total_amt
# 1: GBP 300
# 2: USD 700
# 3: EUR 600
快。
答案 2 :(得分:3)
比@Pgibas的解决方案更脏:
dt[,
list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
by=list(sales_ccy, cost_ccy) # nro of rows reduced to only unique combination ales_ccy, cost_ccy
][,
sum(V2), # this will aggregate the new columns
by=V1
]
<强>基准强>
我做了一些测试来检查我的代码与Arun建议的Data Table 1.9.5的解决方案。
只是一个观察,我刚刚生成500K +行重复原始data.table,这减少了几个sales_ccy / cost_ccy的数量,这也减少了第二个data.table []所挤压的行数(只创建了8行)在这种情况下)。
我不认为在现实世界的场景中,返回的行数将接近500K +(可能,但我刚刚研究过这些东西,N ^ 2,其中N是使用的货币数量),但是仍然要注意观察这些结果。
library(data.table)
library(microbenchmark)
rm(dt)
dt <- data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
dt
for (i in 1:17) dt <- rbind(dt,dt)
mycode <-function() {
test1 <- dt[,
list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
keyby=list(sales_ccy, cost_ccy)
][,
sum(V2), # this will aggregate the new columns
by=V1
]
rm(test1)
}
suggesteEdit <- function() {
test2 <- dt[ , .(c(sales_ccy, cost_ccy), c(sales_amt, cost_amt)) # combine cols
][, .(tot_amt = sum(V2)), keyby= .(ccy = V1) # aggregate + reorder
]
rm(test2)
}
meltWithDataTable195 <- function() {
test3 <- melt(dt, measure = list( c(1,3), c(2,4) ))[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]
rm(test3)
}
microbenchmark(
mycode(),
suggesteEdit(),
meltWithDataTable195()
)
<强>结果强>
Unit: milliseconds
expr min lq mean median uq max neval
mycode() 12.27895 12.47456 15.04098 12.80956 14.73432 45.26173 100
suggesteEdit() 25.36581 29.56553 42.52952 33.39229 59.72346 69.74819 100
meltWithDataTable195() 25.71558 30.97693 47.77700 58.68051 61.23996 66.49597 100
答案 3 :(得分:3)
已编辑使用aggregate()
执行此操作的另一种方法df = data.frame(ccy = c(dt$sales_ccy, dt$cost_ccy), total_amt = c(dt$sales_amt, dt$cost_amt))
out= aggregate(total_amt ~ ccy, data = df, sum)
答案 4 :(得分:2)
肮脏但有效
# Bind costs and sales
df <- rbind(df[,list(ccy = cost_ccy, total_amt = cost_amt)],
df[,list(ccy = sales_ccy, total_amt = sales_amt)])
# Sum for every currency
df[, sum(total_amt), by = ccy]
ccy V1
1: GBP 300
2: USD 700
3: EUR 600