如何使用与其他两列匹配的python填充数据集中的空值?

时间:2017-06-16 09:58:14

标签: python pandas machine-learning scikit-learn missing-data

我有一个巨大的数据集。它有属性,我正在努力工作 1.Age 2.Embark(乘客从哪个港口出发..共有3个港口...... S,Q和C) 3.Survived(0表示未存活,1表示存活)

我正在过滤无用的数据。然后我需要填写Age中出现的Null值。所以我算了一下有多少乘客幸存下来并且在每次登船时都没有幸存,即S,Q和C

我发现从每个S,Q和C口岸出发后幸存下来并且没有幸存的乘客的平均年龄。但是现在我不知道如何在原来的泰坦时代专栏中填写这6个值(3个从S,Q和C中幸存下来,3个从未从S,Q和C中幸存下来......所以总共6个) 。如果我只做titanic.Age.fillna('使用六个值中的一个')它将填充Age的所有Null值和我不想要的那个值。

经过一段时间后,我尝试了这个。

titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)

这显示没有错误,但仍然无法正常工作。知道我该怎么办?

1 个答案:

答案 0 :(得分:0)

我认为您需要groupby apply fillna mean {/ 3>}

titanic['age'] = titanic.groupby(['survived','embarked'])['age']
                        .apply(lambda x: x.fillna(x.mean()))
import seaborn as sns

titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
    survived  pclass     sex  age  sibsp  parch      fare embarked   class  \
5          0       3    male  NaN      0      0    8.4583        Q   Third   
17         1       2    male  NaN      0      0   13.0000        S  Second   
19         1       3  female  NaN      0      0    7.2250        C   Third   
26         0       3    male  NaN      0      0    7.2250        C   Third   
28         1       3  female  NaN      0      0    7.8792        Q   Third   
29         0       3    male  NaN      0      0    7.8958        S   Third   
31         1       1  female  NaN      1      0  146.5208        C   First   
32         1       3  female  NaN      0      0    7.7500        Q   Third   
36         1       3    male  NaN      0      0    7.2292        C   Third   
42         0       3    male  NaN      0      0    7.8958        C   Third   

      who  adult_male deck  embark_town alive  alone  
5     man        True  NaN   Queenstown    no   True  
17    man        True  NaN  Southampton   yes   True  
19  woman       False  NaN    Cherbourg   yes   True  
26    man        True  NaN    Cherbourg    no   True  
28  woman       False  NaN   Queenstown   yes   True  
29    man        True  NaN  Southampton    no   True  
31  woman       False    B    Cherbourg   yes  False  
32  woman       False  NaN   Queenstown   yes   True  
36    man        True  NaN    Cherbourg   yes   True  
42    man        True  NaN    Cherbourg    no   True 
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
                        .apply(lambda x: x.fillna(x.mean()))

#check if values was replaced
print (titanic.loc[idx].head(10))
    survived  pclass     sex        age  sibsp  parch      fare embarked  \
5          0       3    male  30.325000      0      0    8.4583        Q   
17         1       2    male  28.113184      0      0   13.0000        S   
19         1       3  female  28.973671      0      0    7.2250        C   
26         0       3    male  33.666667      0      0    7.2250        C   
28         1       3  female  22.500000      0      0    7.8792        Q   
29         0       3    male  30.203966      0      0    7.8958        S   
31         1       1  female  28.973671      1      0  146.5208        C   
32         1       3  female  22.500000      0      0    7.7500        Q   
36         1       3    male  28.973671      0      0    7.2292        C   
42         0       3    male  33.666667      0      0    7.8958        C   

     class    who  adult_male deck  embark_town alive  alone  
5    Third    man        True  NaN   Queenstown    no   True  
17  Second    man        True  NaN  Southampton   yes   True  
19   Third  woman       False  NaN    Cherbourg   yes   True  
26   Third    man        True  NaN    Cherbourg    no   True  
28   Third  woman       False  NaN   Queenstown   yes   True  
29   Third    man        True  NaN  Southampton    no   True  
31   First  woman       False    B    Cherbourg   yes  False  
32   Third  woman       False  NaN   Queenstown   yes   True  
36   Third    man        True  NaN    Cherbourg   yes   True  
42   Third    man        True  NaN    Cherbourg    no   True  
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived  embarked
0         C           33.666667
          Q           30.325000
          S           30.203966
1         C           28.973671
          Q           22.500000
          S           28.113184
Name: age, dtype: float64