Python:Pandas错误地排除了groupby中的列

时间:2018-02-22 08:29:10

标签: python pandas

我已经看到熊猫无声地排除了滋扰栏,如下所述:Pandas Nuisance columns

它声称如果无法将聚合函数应用于列,它会以静默方式排除列。

考虑以下示例:

我有一个数据框:

df = pd.DataFrame({'C': {0: -0.91985400000000006, 1: -0.042379, 2: 1.2476419999999999, 3: -0.00992, 4: 0.290213, 5: 0.49576700000000001, 6: 0.36294899999999997, 7: 1.548106}, 'A': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'foo'}, 'B': {0: -1.131345, 1: -0.089328999999999992, 2: 0.33786300000000002, 3: -0.94586700000000001, 4: -0.93213199999999996, 5: 1.9560299999999999, 6: 0.017587000000000002, 7: -0.016691999999999999}})

df:
     A      B           C
0   foo -1.131345   -0.919854
1   bar -0.089329   -0.042379
2   foo 0.337863    1.247642
3   bar -0.945867   -0.009920
4   foo -0.932132   0.290213
5   bar 1.956030    0.495767
6   foo 0.017587    0.362949
7   foo -0.016692   1.548106

让我将两列B和C组合起来并转换为numpy ndarray:

df = df.assign(D=df[['B', 'C']].values.tolist())
df['D'] = df['D'].apply(np.array)

df:

     A       B          C                   D
0   foo -1.131345   -0.919854   [-1.131345, -0.9198540000000001]
1   bar -0.089329   -0.042379   [-0.08932899999999999, -0.042379]
2   foo 0.337863    1.247642    [0.337863, 1.247642]
3   bar -0.945867   -0.009920   [-0.945867, -0.00992]
4   foo -0.932132   0.290213    [-0.932132, 0.290213]
5   bar 1.956030    0.495767    [1.95603, 0.495767]
6   foo 0.017587    0.362949    [0.017587000000000002, 0.36294899999999997]
7   foo -0.016692   1.548106    [-0.016692, 1.548106]

现在我可以将均值应用于D列:

print(df['D'].mean())
print(df['B'].mean())
print(df['C'].mean())

[-0.10048563  0.3715655 ]
-0.100485625
0.3715655

但是当我尝试用A组合并获得平均值时,D列就会被删除:

df.groupby('A').mean()

        B         C
 A      
bar  0.306945   0.147823
foo  -0.344944  0.505811

我的问题是,为什么D列被排除在外,即使可以成功应用聚合函数?

而且,一般来说,当一个特定的感兴趣的列是一个numpy数组时,我如何使用像mean或sum这样的聚合函数?

1 个答案:

答案 0 :(得分:0)

是否可行,但在自定义函数中需要if-else

def f(x):
    a = x.mean()
    return a if isinstance(a, (float, int)) else list(a)

df = df.groupby('A').agg(f)
print (df)
            B         C                                 D
A                                                        
bar  0.306945  0.147823  [0.306944666667, 0.147822666667]
foo -0.344944  0.505811           [-0.3449438, 0.5058112]
df = df.groupby('A').agg(lambda x: x.mean())
print (df)
            B         C   D
A                          
bar  0.306945  0.147823 NaN
foo -0.344944  0.505811 NaN