Question

我有一个数据集，其中的列 types 的类型为Primary，Secondary。

df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   ?
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    ?   6
6  123   Secondary    5   6
7  125   Secondary    2   5

我想将每种类型的缺失值替换为median。如-

result_df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   3
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    5   6
6  123   Secondary    5   6
7  125   Secondary    2   5

如何使用Python做到这一点？

Answer 1

类似的事情应该起作用：

首先用实际的?值替换df中的np.nan：

In [1268]: df = df.replace('?',np.nan)
In [1273]: df
Out[1273]: 
    ID      types    C    D
0  101    Primary    2    3
1  103    Primary    6    3
2  108    Primary   10  NaN
3  109    Primary    3   12
4  118  Secondary    5    2
5  122  Secondary  NaN    6
6  123  Secondary    5    6
7  125  Secondary    2    5

对我来说，dtypes和object列的C显示为D。因此，在找到中位数之前，我将它们转换为数字。如果这是不适用于您，请跳过此步骤并直接运行以下内容 transform功能的命令。

In [1274]: df.dtypes
Out[1274]: 
ID        int64
types    object
C        object
D        object
dtype: object

为了找到median，请将列C和D转换为熊猫数字类型：

In [1256]: df.C = df.C.apply(pd.to_numeric)
In [1258]: df.D = df.D.apply(pd.to_numeric)

In [1279]: df.dtypes
Out[1279]: 
ID         int64
types     object
C        float64
D        float64
dtype: object

现在，您可以使用median和C函数在D和groupby两列中用transform类型的空值填充，如下所示：

In [1265]: df.C = df.C.fillna(df.groupby('types')['C'].transform('median'))

In [1266]: df.D = df.D.fillna(df.groupby('types')['D'].transform('median'))

In [1267]: df
Out[1267]: 
    ID      types     C     D
0  101    Primary   2.0   3.0
1  103    Primary   6.0   3.0
2  108    Primary  10.0   3.0
3  109    Primary   3.0  12.0
4  118  Secondary   5.0   2.0
5  122  Secondary   5.0   6.0
6  123  Secondary   5.0   6.0
7  125  Secondary   2.0   5.0

让我知道这是否有帮助。

Answer 2

如@Mayank Porwal所述，首先将缺少的值转换为np.nan，然后可以使用sklearn impute方法应用插补。

simpleImputer

import numpy as np
import pandas as pd

df.replace('?',np.nan,inplace=True)

from sklearn.impute import SimpleImputer

for types,group in df.groupby('types'):
    imp = SimpleImputer(missing_values=np.nan, strategy='median')
    df.loc[df['types']==types,['C','D']] = imp.fit_transform(group[['C','D']])

如何处理有关数据集类型的缺失数据？

2 个答案: