Pandas agg具有大量按键的自定义功能

时间:2018-02-08 02:11:37

标签: python pandas pandas-groupby

当键数超过10k时真的很慢,这确实很常见。有没有办法加快速度?

import pandas as pd

n = 10*1000000
ngroup = 10000
m = n//ngroup

d = pd.DataFrame({"a":range(n), "b":list(range(ngroup))*m})

%timeit dagg = d.groupby("b")["a"].agg(["mean","std"]).reset_index()
#700 ms

#custom function
%timeit dagg = d.groupby("b")["a"].agg(lambda x: x.mean()+x.std()).reset_index()
#4.37 s

R< data.table

中的比较
require(data.table)

n = 10*1000000
ngroup = 10000
m = n/ngroup
DT = data.table(a = 0:(n-1), b = rep(0:(ngroup-1), m))

system.time({dagg = DT[, .(m = mean(a), s = sd(a)), by = b]})
#0.42 sec

#custom function
f <- function(x)mean(x)+sd(x)
system.time({ dagg = DT[, .(k =f(a)), by = b] })
#0.81 sec

1 个答案:

答案 0 :(得分:2)

如果只达到你所需要的(总和平均值和标准值),我认为在groupby方面做得更有效率

%timeit d.groupby("b")["a"].agg(["mean","std"])
1 loop, best of 3: 698 ms per loop


%timeit d.groupby("b")["a"].agg(["mean","std"]).sum(1)
1 loop, best of 3: 704 ms per loop

你的:

%timeit d.groupby("b")["a"].agg(lambda x: x.mean()+x.std())
1 loop, best of 3: 2.89 s per loop