我想对一个表进行分组,以使前两个列在分组时保持原样,第3d是分组均值,第4个是分组离散度,在代码中定义。这是我目前的操作方式:
x = pd.DataFrame(np.array(((1,1,1,1),(1,1,10,2),(2,2,2,2),(2,2,8,3))))
0 1 2 3
0 1 1 1 1
1 1 1 10 2
2 2 2 2 2
3 2 2 8 3
g = x.groupby(0)
res = g.mean()
res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
res
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
无论如何,我正在寻求加快速度。特别是如果我可以摆脱apply
而只使用g
一次,那将是很好的。
出于测试目的,它以以下数据大小运行:
这是一个中型样本:
array([[ 0.00000000e+000, 4.70221520e-003, 1.14943038e-003,
3.44829114e-009],
[ 1.81557753e-011, 4.94065646e-324, 4.70221520e-003,
1.14943038e-003],
[ 2.36416931e-008, 1.97231804e-011, 9.88131292e-324,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12130677e-010,
1.48219694e-323],
[ 8.43322640e-003, 1.74911362e-003, 3.42014182e-009,
1.11974506e-010],
[ 1.97626258e-323, 4.70221520e-003, 1.14943038e-003,
3.48747627e-009],
[ 1.78945412e-011, 2.47032823e-323, 4.70221520e-003,
1.14943038e-003],
[ 2.32498418e-008, 1.85476266e-010, 2.96439388e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.50053798e-009, 1.85476266e-011,
3.45845952e-323],
[ 4.70221520e-003, 1.14943038e-003, 4.53241298e-008,
3.00419304e-010],
[ 3.95252517e-323, 4.70221520e-003, 1.14943038e-003,
3.55278482e-009],
[ 1.80251583e-011, 4.44659081e-323, 4.70221520e-003,
1.14943038e-003],
[ 1.09587738e-008, 1.68496045e-011, 4.94065646e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.48747627e-009, 1.80251583e-011,
5.43472210e-323],
[ 4.70221520e-003, 1.14943038e-003, 3.90545096e-008,
2.63846519e-010],
[ 5.92878775e-323, 8.43322640e-003, 1.74911362e-003,
3.15465136e-009],
[ 1.04009792e-010, 6.42285340e-323, 8.43322640e-003,
1.74911362e-003],
[ 2.56120209e-010, 4.15414486e-011, 6.91691904e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12286848e-010,
7.41098469e-323],
[ 8.43322640e-003, 1.74911362e-003, 5.91887557e-009,
1.45863583e-010],
[ 7.90505033e-323, 8.43322640e-003, 1.74911362e-003,
3.34205639e-009],
[ 1.07133209e-010, 8.39911598e-323, 8.43322640e-003,
1.74911362e-003],
[ 1.21188587e-009, 7.07453993e-011, 8.89318163e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.38890765e-009, 1.12130677e-010,
9.38724727e-323],
[ 8.43322640e-003, 1.74911362e-003, 1.79596488e-009,
8.38637515e-011]])
答案 0 :(得分:2)
您可以将合成糖-.groupby
与Series
一起使用:
res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
print (res)
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
我会用你的这个timigs数组:
In [279]: %%timeit
...: res = x.groupby(0).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
...:
4.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [280]: %%timeit
...: g = x.groupby(0)
...: res = g.mean()
...: res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
...:
11 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
如果可能,还要关闭按分组列的排序:
In [283]: %%timeit
...: res = x.groupby(0, sort=False).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0], sort=False).max() - (x[2] - x[3]).groupby(x[0], sort=False).min())*.5
...:
4.1 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)