Question

我将解释我的问题陈述：

假设我有训练数据和测试数据。对于训练和测试，我在同一列中有NaN值。现在，我对南插补的策略是：对某列进行分组，并用该组的平均值填充nan。示例：

x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
    'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})

    Occupation  salary  expenditure
0   driver      100     20.0    
1   driver      150     40.0    
2   mechanic    70      10.0
3   teacher     300     100.0
4   mechanic    90      NaN 
5   teacher     250     80.0
6   unemployed  10      0.0  
7   driver      90      NaN   
8   mechanic    110     40.0    
9   teacher     350     120.0

对于火车数据，我可以这样：

x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())

但是我该如何对测试数据执行类似的操作。平均值就是训练组的平均值。我正在尝试使用for循环来完成此操作，但它要花很多时间。

Answer 1

创建mean至Series：

mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver         30.0
mechanic       25.0
teacher       100.0
unemployed      0.0
Name: expenditure, dtype: float64

然后用Series.map和Series.fillna替换缺失值：

x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
   Occupation  salary  expenditure
0      driver     100         20.0
1      driver     150         40.0
2    mechanic      70         10.0
3     teacher     300        100.0
4    mechanic      90         25.0
5     teacher     250         80.0
6  unemployed      10          0.0
7      driver      90         30.0
8    mechanic     110         40.0
9     teacher     350        120.0

以同样的方式处理test数据：

x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))

编辑：

多列解决方案-map使用DataFrame.join：

x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
    'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
    
mean = x_train.groupby('Occupation').mean()
print (mean)
                salary  expenditure  expenditure1
Occupation                                       
driver      113.333333         30.0          30.0
mechanic     90.000000         25.0          25.0
teacher     300.000000        100.0         100.0
unemployed   10.000000          0.0           0.0

x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
   Occupation  salary  expenditure  expenditure1 col
0      driver     100         20.0          20.0   a
1      driver     150         40.0          40.0   a
2    mechanic      70         10.0          10.0   b
3     teacher     300        100.0         100.0   b
4    mechanic      90         25.0          25.0   d
5     teacher     250         80.0          80.0   d
6  unemployed      10          0.0           0.0   e
7      driver      90         30.0          30.0   e
8    mechanic     110         40.0          40.0   h
9     teacher     350        120.0         120.0   h

x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))

使用Train Data Statistics填充熊猫中的NaN值

1 个答案: