如何处理有关数据集类型的缺失数据?

时间:2018-12-03 08:23:30

标签: python pandas dataframe

我有一个数据集,其中的列 types 的类型为PrimarySecondary

  

df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   ?
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    ?   6
6  123   Secondary    5   6
7  125   Secondary    2   5 

我想将每种类型的缺失值替换为median。如-

  

result_df

   ID    types        C   D
0  101   Primary      2   3
1  103   Primary      6   3
2  108   Primary     10   3
3  109   Primary      3  12
4  118   Secondary    5   2
5  122   Secondary    5   6
6  123   Secondary    5   6
7  125   Secondary    2   5 

如何使用Python做到这一点?

2 个答案:

答案 0 :(得分:2)

类似的事情应该起作用:

首先用实际的?值替换df中的np.nan

In [1268]: df = df.replace('?',np.nan)
In [1273]: df
Out[1273]: 
    ID      types    C    D
0  101    Primary    2    3
1  103    Primary    6    3
2  108    Primary   10  NaN
3  109    Primary    3   12
4  118  Secondary    5    2
5  122  Secondary  NaN    6
6  123  Secondary    5    6
7  125  Secondary    2    5
  

对我来说,dtypesobject列的C显示为D。因此,在找到中位数之前,我将它们转换为数字。如果这是   不适用于您,请跳过此步骤并直接运行以下内容   transform功能的命令。

In [1274]: df.dtypes
Out[1274]: 
ID        int64
types    object
C        object
D        object
dtype: object

为了找到median,请将列CD转换为熊猫数字类型:

In [1256]: df.C = df.C.apply(pd.to_numeric)
In [1258]: df.D = df.D.apply(pd.to_numeric)

In [1279]: df.dtypes
Out[1279]: 
ID         int64
types     object
C        float64
D        float64
dtype: object

现在,您可以使用medianC函数在Dgroupby两列中用transform类型的空值填充,如下所示:

In [1265]: df.C = df.C.fillna(df.groupby('types')['C'].transform('median'))

In [1266]: df.D = df.D.fillna(df.groupby('types')['D'].transform('median'))

In [1267]: df
Out[1267]: 
    ID      types     C     D
0  101    Primary   2.0   3.0
1  103    Primary   6.0   3.0
2  108    Primary  10.0   3.0
3  109    Primary   3.0  12.0
4  118  Secondary   5.0   2.0
5  122  Secondary   5.0   6.0
6  123  Secondary   5.0   6.0
7  125  Secondary   2.0   5.0

让我知道这是否有帮助。

答案 1 :(得分:1)

如@Mayank Porwal所述,首先将缺少的值转换为np.nan,然后可以使用sklearn impute方法应用插补。

simpleImputer

import numpy as np
import pandas as pd

df.replace('?',np.nan,inplace=True)

from sklearn.impute import SimpleImputer

for types,group in df.groupby('types'):
    imp = SimpleImputer(missing_values=np.nan, strategy='median')
    df.loc[df['types']==types,['C','D']] = imp.fit_transform(group[['C','D']])