在单个列上使用“应用”加快分组速度

时间:2018-07-30 10:51:21

标签: python performance pandas numpy pandas-groupby

我想对一个表进行分组,以使前两个列在分组时保持原样,第3d是分组均值,第4个是分组离散度,在代码中定义。这是我目前的操作方式:

x = pd.DataFrame(np.array(((1,1,1,1),(1,1,10,2),(2,2,2,2),(2,2,8,3))))

   0  1   2  3
0  1  1   1  1
1  1  1  10  2
2  2  2   2  2
3  2  2   8  3

g      = x.groupby(0)
res    = g.mean()
res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
res

     1    2    3
0               
1  1.0  5.5  6.0
2  2.0  5.0  5.5

无论如何,我正在寻求加快速度。特别是如果我可以摆脱apply而只使用g一次,那将是很好的。

出于测试目的,它以以下数据大小运行:

  1. 几到60行
  2. 1-5个小组(可能有一个小组)
  3. 4列

这是一个中型样本:

array([[  0.00000000e+000,   4.70221520e-003,   1.14943038e-003,
      3.44829114e-009],
   [  1.81557753e-011,   4.94065646e-324,   4.70221520e-003,
      1.14943038e-003],
   [  2.36416931e-008,   1.97231804e-011,   9.88131292e-324,
      8.43322640e-003],
   [  1.74911362e-003,   3.43575891e-009,   1.12130677e-010,
      1.48219694e-323],
   [  8.43322640e-003,   1.74911362e-003,   3.42014182e-009,
      1.11974506e-010],
   [  1.97626258e-323,   4.70221520e-003,   1.14943038e-003,
      3.48747627e-009],
   [  1.78945412e-011,   2.47032823e-323,   4.70221520e-003,
      1.14943038e-003],
   [  2.32498418e-008,   1.85476266e-010,   2.96439388e-323,
      4.70221520e-003],
   [  1.14943038e-003,   3.50053798e-009,   1.85476266e-011,
      3.45845952e-323],
   [  4.70221520e-003,   1.14943038e-003,   4.53241298e-008,
      3.00419304e-010],
   [  3.95252517e-323,   4.70221520e-003,   1.14943038e-003,
      3.55278482e-009],
   [  1.80251583e-011,   4.44659081e-323,   4.70221520e-003,
      1.14943038e-003],
   [  1.09587738e-008,   1.68496045e-011,   4.94065646e-323,
      4.70221520e-003],
   [  1.14943038e-003,   3.48747627e-009,   1.80251583e-011,
      5.43472210e-323],
   [  4.70221520e-003,   1.14943038e-003,   3.90545096e-008,
      2.63846519e-010],
   [  5.92878775e-323,   8.43322640e-003,   1.74911362e-003,
      3.15465136e-009],
   [  1.04009792e-010,   6.42285340e-323,   8.43322640e-003,
      1.74911362e-003],
   [  2.56120209e-010,   4.15414486e-011,   6.91691904e-323,
      8.43322640e-003],
   [  1.74911362e-003,   3.43575891e-009,   1.12286848e-010,
      7.41098469e-323],
   [  8.43322640e-003,   1.74911362e-003,   5.91887557e-009,
      1.45863583e-010],
   [  7.90505033e-323,   8.43322640e-003,   1.74911362e-003,
      3.34205639e-009],
   [  1.07133209e-010,   8.39911598e-323,   8.43322640e-003,
      1.74911362e-003],
   [  1.21188587e-009,   7.07453993e-011,   8.89318163e-323,
      8.43322640e-003],
   [  1.74911362e-003,   3.38890765e-009,   1.12130677e-010,
      9.38724727e-323],
   [  8.43322640e-003,   1.74911362e-003,   1.79596488e-009,
      8.38637515e-011]])

1 个答案:

答案 0 :(得分:2)

您可以将合成糖-.groupbySeries一起使用:

res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
print (res)
     1    2    3
0               
1  1.0  5.5  6.0
2  2.0  5.0  5.5

我会用你的这个timigs数组:

In [279]: %%timeit
     ...: res    = x.groupby(0).mean()
     ...: res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
     ...: 
4.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [280]: %%timeit
     ...: g      = x.groupby(0)
     ...: res    = g.mean()
     ...: res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
     ...: 
11 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如果可能,还要关闭按分组列的排序:

In [283]: %%timeit
     ...: res    = x.groupby(0, sort=False).mean()
     ...: res[3] = ((x[2] + x[3]).groupby(x[0], sort=False).max() - (x[2] - x[3]).groupby(x[0], sort=False).min())*.5
     ...: 
4.1 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)