当键数超过10k时真的很慢,这确实很常见。有没有办法加快速度?
import pandas as pd
n = 10*1000000
ngroup = 10000
m = n//ngroup
d = pd.DataFrame({"a":range(n), "b":list(range(ngroup))*m})
%timeit dagg = d.groupby("b")["a"].agg(["mean","std"]).reset_index()
#700 ms
#custom function
%timeit dagg = d.groupby("b")["a"].agg(lambda x: x.mean()+x.std()).reset_index()
#4.37 s
R< data.table
中的比较require(data.table)
n = 10*1000000
ngroup = 10000
m = n/ngroup
DT = data.table(a = 0:(n-1), b = rep(0:(ngroup-1), m))
system.time({dagg = DT[, .(m = mean(a), s = sd(a)), by = b]})
#0.42 sec
#custom function
f <- function(x)mean(x)+sd(x)
system.time({ dagg = DT[, .(k =f(a)), by = b] })
#0.81 sec
答案 0 :(得分:2)
如果只达到你所需要的(总和平均值和标准值),我认为在groupby方面做得更有效率
%timeit d.groupby("b")["a"].agg(["mean","std"])
1 loop, best of 3: 698 ms per loop
%timeit d.groupby("b")["a"].agg(["mean","std"]).sum(1)
1 loop, best of 3: 704 ms per loop
你的:
%timeit d.groupby("b")["a"].agg(lambda x: x.mean()+x.std())
1 loop, best of 3: 2.89 s per loop