分组后在熊猫中填充NaN值

时间:2020-09-06 04:04:49

标签: pandas dataframe group-by

这个问题与通常的NaN值填充略有不同。

假设我有一个数据框,在其中按某种类别分组。现在,我想使用该组的平均值但来自不同列的值来填充列的NaN值。 让我举个例子:

a = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
    'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
a['diff'] = a.salary - a.expenditure

    Occupation  salary  expenditure diff
0   driver      100     20.0        80.0
1   driver      150     40.0        110.0
2   mechanic    70      10.0        60.0
3   teacher     300     100.0       200.0
4   mechanic    90      NaN         NaN
5   teacher     250     80.0        170.0
6   unemployed  10      0.0         10.0
7   driver      90      NaN         NaN
8   mechanic    110     40.0        70.0
9   teacher     350     120.0       230.0

因此,在上述情况下,我想将支出的NaN值填写为: 薪水-每个组的均值(差异)。

我该如何使用熊猫呢?

1 个答案:

答案 0 :(得分:2)

您可以使用所需的值groupby.transform创建该新系列,并用于更新目标列。

假设您要按Occupation分组

a['mean_diff'] = a.groupby('Occupation')['diff'].transform('mean')
a.expenditure.mask(
    a.expenditure.isna(),
    a.salary - a.mean_diff,
    inplace=True
)

输出

   Occupation  salary  expenditure   diff  mean_diff
0      driver     100         20.0   80.0       95.0
1      driver     150         40.0  110.0       95.0
2    mechanic      70         10.0   60.0       65.0
3     teacher     300        100.0  200.0      200.0
4    mechanic      90         25.0    NaN       65.0
5     teacher     250         80.0  170.0      200.0
6  unemployed      10          0.0   10.0       10.0
7      driver      90         -5.0    NaN       95.0
8    mechanic     110         40.0   70.0       65.0
9     teacher     350        120.0  230.0      200.0