我有一个巨大的数据集。它有属性,我正在努力工作 1.Age 2.Embark(乘客从哪个港口出发..共有3个港口...... S,Q和C) 3.Survived(0表示未存活,1表示存活)
我正在过滤无用的数据。然后我需要填写Age中出现的Null值。所以我算了一下有多少乘客幸存下来并且在每次登船时都没有幸存,即S,Q和C
我发现从每个S,Q和C口岸出发后幸存下来并且没有幸存的乘客的平均年龄。但是现在我不知道如何在原来的泰坦时代专栏中填写这6个值(3个从S,Q和C中幸存下来,3个从未从S,Q和C中幸存下来......所以总共6个) 。如果我只做titanic.Age.fillna('使用六个值中的一个')它将填充Age的所有Null值和我不想要的那个值。
经过一段时间后,我尝试了这个。
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
这显示没有错误,但仍然无法正常工作。知道我该怎么办?
答案 0 :(得分:0)
我认为您需要groupby
apply
fillna
mean
{/ 3>}
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64