我想在pandas中使用udaf因为没有nunique功能,但我发现它花费了太多时间。所以我在现有函数上测试了不同的聚合方法,发现不同方法之间存在很大差异,如下所示“
import numpy as np
import pandas as pd
recordNum = 10
n = 2
varNum = 1000
keys = ['ID%04d' %i for i in range(int(recordNum/n))] * n
varlst1 = ['x%04d' %i for i in range(varNum)]
dsDict1 = {k:np.random.choice(range(1000), recordNum) for k in varlst1}
dsDict1['ID'] = keys
df1 = pd.DataFrame(dsDict1)
dfg = df1.groupby(['ID'])
# 1.47ms
%timeit dfg.sum()
# 1.48ms
%timeit dfg.aggregate(sum)
# 1.5s
%timeit dfg.aggregate(pd.Series.sum)
# 361ms
%timeit dfg.aggregate({}.fromkeys(varlst1, sum))