Question

我是Pandas和Numpy的新手。我试图解决Kaggle | Titanic Dataset。现在我必须修复两个列，“Age”和“Embarked”，因为它们包含NAN。

现在我尝试fillna但没有取得任何成功，很快就发现我错过了inplace = True。

现在我附上了他们。但第一次估算是成功的，但第二次则没有。我尝试在搜索引擎优化和谷歌搜索，但没有找到任何有用的东西。请帮帮我。

这是我正在尝试的代码。

# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)

# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)

print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size

我得到的输出为

0
2

然而，我设法得到了我想要的东西而不使用inplace=True

titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())

但我很好奇second usage的{{1}}是什么。

如果我问的是非常愚蠢的东西，请耐心等待，因为我是全新的，我可能会想念小事。任何帮助表示赞赏。提前谢谢。

Answer 1

pd.Series.mode返回一个系列。

变量具有单个算术平均值和单个中值，但它可能有多种模式。如果多个值具有最高频率，则会有多种模式。

pandas在标签上运作。

titanic_df.mean()
Out: 
PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

如果我要使用titanic_df.fillna(titanic_df.mean())，则会返回一个新的DataFrame，其中列PassengerId填充446.0，列Survived填充0.38，依此类推。

但是，如果我在Series上调用mean方法，则返回值为float：

titanic_df['Age'].mean()
Out: 29.69911764705882

此处没有关联的标签。因此，如果我使用titanic_df.fillna(titanic_df['Age'].mean())，则所有列中的所有缺失值都将填充29.699。

为什么第一次尝试不成功

您尝试使用titanic_df填充整个数据框titanic_df["Embarked"].mode()。我们先检查一下输出：

titanic_df["Embarked"].mode()
Out: 
0    S
dtype: object

这是一个单一元素的系列。索引为0，值为S.现在，请记住如果我们使用titanic_df.mean()填充它将如何工作：它将使用相应的平均值填充每列。在这里，我们只有一个标签。因此，如果我们有一个名为0的列，它将只填充值。尝试添加df[0] = np.nan并再次执行您的代码。您会看到新列填充了S。

为什么第二次尝试是（非）成功

等式的右侧，titanic_df.fillna(titanic_df["Embarked"].mode())返回一个新的DataFrame。在这个新的DataFrame中，Embarked列仍然有nan：

titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2

但是，您没有将其分配回整个DataFrame。您已将此DataFrame分配给系列 - titanic_df['Embarked']。它实际上并没有填充Embarked列中的缺失值，它只使用了DataFrame的索引值。如果您实际检查新列，您将看到数字1,2，...而不是S，C和Q.

你应该做什么

您正尝试使用单个值填充单个列。首先，将该值与其标签取消关联：

titanic_df['Embarked'].mode()[0]
Out: 'S'

现在，使用inplace=True或分配结果并不重要。两个

titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

和

titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

将使用S填充“已启用”列中的缺失值。

当然，如果有多种模式，则假定您要使用第一个值。您可能需要在那里改进算法（例如，如果有多种模式，则从值中随机选择）。

在Pandas跑两次的fillna有什么问题？

1 个答案:

pd.Series.mode返回一个系列。

pandas在标签上运作。

为什么第一次尝试不成功

为什么第二次尝试是（非）成功

你应该做什么