以参数为条件计算pandas df的平均值

时间:2018-03-15 13:31:29

标签: python pandas dataframe

我有以下df:

  import numpy as np
  import pandas as pd
  a = [] 
  for i in range(5):
      tmp_df = pd.DataFrame(np.random.random((10,4)))
      tmp_df['lvl'] = i
      a.append(tmp_df) 
  df = pd.concat(a, axis=0)

df =

          0         1         2         3  lvl
0  0.928623  0.868600  0.854186  0.129116    0
1  0.667870  0.901285  0.539412  0.883890    0
2  0.384494  0.697995  0.242959  0.725847    0
3  0.993400  0.695436  0.596957  0.142975    0
4  0.518237  0.550585  0.426362  0.766760    0
5  0.359842  0.417702  0.873988  0.217259    0
6  0.820216  0.823426  0.585223  0.553131    0
7  0.492683  0.401155  0.479228  0.506862    0
..............................................   
3  0.505096  0.426465  0.356006  0.584958    3
4  0.145472  0.558932  0.636995  0.318406    3
5  0.957969  0.068841  0.612658  0.184291    3
6  0.059908  0.298270  0.334564  0.738438    3
7  0.662056  0.074136  0.244039  0.848246    3
8  0.997610  0.043430  0.774946  0.097294    3
9  0.795873  0.977817  0.780772  0.849418    3
0  0.577173  0.430014  0.133300  0.760223    4
1  0.916126  0.623035  0.240492  0.638203    4
2  0.165028  0.626054  0.225580  0.356118    4
3  0.104375  0.137684  0.084631  0.987290    4
4  0.934663  0.835608  0.764334  0.651370    4
5  0.743265  0.072671  0.911947  0.925644    4
6  0.212196  0.587033  0.230939  0.994131    4
7  0.945275  0.238572  0.696123  0.536136    4
8  0.989021  0.073608  0.720132  0.254656    4
9  0.513966  0.666534  0.270577  0.055597    4

我正在学习整洁的熊猫功能,因此想知道,在lvl列中计算平均值的最简单方法是什么?

我的意思是:

(df [df.lvl == 0] + df [df.lvl == 1] + df [df.lvl == 2] + df [df.lvl == 3] + df [df.lvl = = 4])/ 5

所需的输出应该是一个形状(10,4)的表,没有列lvl,其中每个元素是5个元素的平均值(lvl = [0,1,2,3,4]我希望它有所帮助。

3 个答案:

答案 0 :(得分:1)

我认为需要:

np.random.seed(456)
a = [] 
for i in range(5):
    tmp_df = pd.DataFrame(np.random.random((10,4)))
    tmp_df['lvl'] = i
    a.append(tmp_df) 
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] + 
       df[df.lvl ==2 ] + df[df.lvl ==3 ] + 
       df[df.lvl ==4 ]) / 5
print (df1)
          0         1         2         3  lvl
0  0.411557  0.520560  0.578900  0.541576    2
1  0.253469  0.655714  0.532784  0.620744    2
2  0.468099  0.576198  0.400485  0.333533    2
3  0.620207  0.367649  0.531639  0.475587    2
4  0.699554  0.548005  0.683745  0.457997    2
5  0.322487  0.316137  0.489660  0.362146    2
6  0.430058  0.159712  0.631610  0.641141    2
7  0.399944  0.511944  0.346402  0.754591    2
8  0.400190  0.373925  0.340727  0.407988    2
9  0.502879  0.399614  0.321710  0.715812    2

df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
          0         1         2         3
0  0.411557  0.520560  0.578900  0.541576
1  0.253469  0.655714  0.532784  0.620744
2  0.468099  0.576198  0.400485  0.333533
3  0.620207  0.367649  0.531639  0.475587
4  0.699554  0.548005  0.683745  0.457997
5  0.322487  0.316137  0.489660  0.362146
6  0.430058  0.159712  0.631610  0.641141
7  0.399944  0.511944  0.346402  0.754591
8  0.400190  0.373925  0.340727  0.407988
9  0.502879  0.399614  0.321710  0.715812

编辑:

如果DataFrame的每个子集都具有从0len(subset)的索引:

df2 = df.mean(level=0)
print (df2)
          0         1         2         3  lvl
0  0.411557  0.520560  0.578900  0.541576    2
1  0.253469  0.655714  0.532784  0.620744    2
2  0.468099  0.576198  0.400485  0.333533    2
3  0.620207  0.367649  0.531639  0.475587    2
4  0.699554  0.548005  0.683745  0.457997    2
5  0.322487  0.316137  0.489660  0.362146    2
6  0.430058  0.159712  0.631610  0.641141    2
7  0.399944  0.511944  0.346402  0.754591    2
8  0.400190  0.373925  0.340727  0.407988    2
9  0.502879  0.399614  0.321710  0.715812    2

答案 1 :(得分:1)

groupby功能正是您想要的。它将根据条件进行分组,在这种情况下'lvl'相同,然后将mean函数应用于该组中每列的值。

df.groupby('lvl').mean()

答案 2 :(得分:1)

您似乎希望按索引进行分组,并取除lvl以外的所有列的平均值

df.groupby(df.index)[[0,1,2,3]].mean()

对于使用

生成的数据框
np.random.seed(456)
a = [] 
for i in range(5):
    tmp_df = pd.DataFrame(np.random.random((10,4)))
    tmp_df['lvl'] = i
    a.append(tmp_df) 
df = pd.concat(a, axis=0)

df.groupby(df.index)[[0,1,2,3]].mean()

输出:

          0         1         2         3
0  0.411557  0.520560  0.578900  0.541576
1  0.253469  0.655714  0.532784  0.620744
2  0.468099  0.576198  0.400485  0.333533
3  0.620207  0.367649  0.531639  0.475587
4  0.699554  0.548005  0.683745  0.457997
5  0.322487  0.316137  0.489660  0.362146
6  0.430058  0.159712  0.631610  0.641141
7  0.399944  0.511944  0.346402  0.754591
8  0.400190  0.373925  0.340727  0.407988
9  0.502879  0.399614  0.321710  0.715812

的输出相同
df.groupby(df.groupby('lvl').cumcount()).mean()

没有诉诸双人组。

IMO这个阅读更清晰,对于大型数据帧来说会更快。