我将解释我的问题陈述:
假设我有训练数据和测试数据。 对于训练和测试,我在同一列中有NaN值。现在,我对南插补的策略是: 对某列进行分组,并用该组的平均值填充nan。示例:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 NaN
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 NaN
8 mechanic 110 40.0
9 teacher 350 120.0
对于火车数据,我可以这样:
x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())
但是我该如何对测试数据执行类似的操作。平均值就是训练组的平均值。 我正在尝试使用for循环来完成此操作,但它要花很多时间。
答案 0 :(得分:1)
创建mean
至Series
:
mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver 30.0
mechanic 25.0
teacher 100.0
unemployed 0.0
Name: expenditure, dtype: float64
然后用Series.map
和Series.fillna
替换缺失值:
x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 25.0
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 30.0
8 mechanic 110 40.0
9 teacher 350 120.0
以同样的方式处理test
数据:
x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))
编辑:
多列解决方案-map
使用DataFrame.join
:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
mean = x_train.groupby('Occupation').mean()
print (mean)
salary expenditure expenditure1
Occupation
driver 113.333333 30.0 30.0
mechanic 90.000000 25.0 25.0
teacher 300.000000 100.0 100.0
unemployed 10.000000 0.0 0.0
x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
Occupation salary expenditure expenditure1 col
0 driver 100 20.0 20.0 a
1 driver 150 40.0 40.0 a
2 mechanic 70 10.0 10.0 b
3 teacher 300 100.0 100.0 b
4 mechanic 90 25.0 25.0 d
5 teacher 250 80.0 80.0 d
6 unemployed 10 0.0 0.0 e
7 driver 90 30.0 30.0 e
8 mechanic 110 40.0 40.0 h
9 teacher 350 120.0 120.0 h
x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))