尽管具有相同的列,但feature_names在xgboost中存在错误

时间:2018-09-30 12:46:14

标签: python xgboost

我将训练(X)和测试数据(test_data_process)设置为相同的列和顺序,如下所示:

enter image description here

但是当我这样做

predictions = my_model.predict(test_data_process)    

它出现以下错误:

  

ValueError:feature_names不匹配:['f0','f1','f2','f3','f4','f5','f6','f7','f8','f9',' f10”,“ f11”,“ f12”,“ f13”,“ f14”,“ f15”,“ f16”,“ f17”,“ f18”,“ f19”,“ f20”,“ f21”,“ f22” ,“ f23”,“ f24”,“ f25”,“ f26”,“ f27”,“ f28”,“ f29”,“ f30”,“ f31”,“ f32”,“ f33”,“ f34”]] [ 'MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSFSF' ','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd',' “ GarageCars”,“ GarageArea”,“ WoodDeckSF”,“ OpenPorchSF”,“ EnclosedPorch”,“ 3SsnPorch”,“ ScreenPorch”,“ PoolArea”,“ MiscVal”,“ YrMoSold”]   预期的f22,f25,f0,f34,f32,f5,f20,f3,f33,f15,f24,f31,f28,f9,f8,f19,f14,f18,f17,f2,f13,f4,f27,f16,f1 ,f29,f11,f26,f10,f7,f21,f30,f23,f6,f12   训练数据没有以下字段:OpenPorchSF,BsmtFinSF1,LotFrontage,GrLivArea,YrMoSold,FullBath,TotRmsAbvGrd,GarageCars,YearRemodAdd,BedroomAbvGr,PoolArea,KitchenAbvGr,LotAreas,HalfBathFuns,MiscVal,MSSFBu, ,ScreenPorch,3SsnPorch,TotalBsmtSF,GarageYrBlt,MasVnrArea,TotalQuality,Fireplaces,WoodDeckSF,2ndFlrSF,BsmtFinSF2,BsmtHalfBath,LowQualFinSF,TotalCond,GarageArea

因此,它抱怨训练数据(X)没有这些字段,而有。

如何解决此问题?

[更新]:

我的代码:

X = data.select_dtypes(exclude=['object']).drop(columns=['Id'])
X['YrMoSold'] = X['YrSold'] * 12 + X['MoSold']
X = X.drop(columns=['YrSold', 'MoSold', 'SalePrice'])
X = X.fillna(0.0000001)

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.05, booster='gbtree')
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object']).drop(columns=['Id'])
test_data_process['YrMoSold'] = test_data_process['YrSold'] * 12 + test_data['MoSold']
test_data_process = test_data_process.drop(columns=['YrSold', 'MoSold'])
test_data_process = test_data_process.fillna(0.0000001)
test_data_process = test_data_process[X.columns]

predictions = my_model.predict(test_data_process)    

1 个答案:

答案 0 :(得分:3)

那是一个诚实的错误。

输入数据时,您正在使用np数组:

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size=0.2)

X.values 是一个np.array)

未定义列名

输入预测数据集时,您正在使用数据框

您应该使用:

predictions = my_model.predict(test_data_process.values)  

(添加 .values