以正确的方式更新Pandas Dataframe中的列值

时间:2019-12-24 12:01:49

标签: python pandas kaggle

我正在kaggle中进行常见的入门比赛,并且意识到将年龄添加到分类器中会有所帮助。问题是,它的Age列具有 NaN 值,我不想填写整个df上的所有NaN,而仅是Age列。我应用下面的解决方案(通过获取中位数),然后将行作为目标并进行更新,例如_train['Age'] = X_train['Age'].fillna(X_train_median)

我知道这不是一个好习惯,它可以工作,但是出现以下错误

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

是否可以以更好的方式为df中与特定条件匹配的所有值更新特定列?下面的示例代码。

# IMPORT DATA 
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

# ASSIGN TO VAR
X_test = test_data
X = train_data
y = train_data["Survived"]

# SPLIT 
X_train, X_val, Y_train, Y_val = train_test_split(X, y, random_state=1)

# SELECTED FEATURES 
features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked", "Age"]


# REMOVE NA's BY POPULATING WITH MEDIAN VAL
X_train_median = X_train['Age'].median()
X_val_median = X_val['Age'].median()
X_test_median = X_test['Age'].median()

X_train['Age'] = X_train['Age'].fillna(X_train_median)
X_val['Age'] = X_val['Age'].fillna(X_val_median)
X_test['Age'] = X_test['Age'].fillna(X_test_median)


# ONE HOT FOR CATAGORICAL VALS
X_train = pd.get_dummies(X_train[features])
X_val = pd.get_dummies(X_val[features])
X_test = pd.get_dummies(X_test[features])

2 个答案:

答案 0 :(得分:1)

我认为这应该有效:

X_train['Age'] = X_train.loc[:, 'Age'].fillna(X_train_median)
X_val['Age'] = X_val.loc[:, 'Age'].fillna(X_val_median)
X_test['Age'] = X_test.loc[:, 'Age'].fillna(X_test_median)

文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

答案 1 :(得分:1)

您可以尝试使用X作为某些DataFrame:

X = X.assign(Age = X['Age'].fillna(value=X_median))