XGBRegressor预测功能不匹配

时间:2018-09-19 05:17:34

标签: python xgboost

我想使用XGBRegressor预测一些数据。因此,我加载了训练数据和测试数据。

iowa_file_path = '../input/train.csv'
test_data_path = '../input/test.csv'

data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(test_data_path)

数据内容

enter image description here

test_data的内容

enter image description here

然后我进行一些数据清理

data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size =0.25)
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
val_X = my_imputer.transform(val_X)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.1)
my_model.fit(train_X, train_y, early_stopping_rounds=None, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object'])
predictions = my_model.predict(test_data_process)

但是在运行predict函数时收到以下错误消息:

  
     

ValueError跟踪(最近一次通话最近)    在()中         1个test_data_process = test_data.select_dtypes(exclude = ['object'])   ----> 2个预测= my_model.predict(test_data_process)

     

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/sklearn.py在predict(自身,数据,output_margin,ntree_limit,validate_features)中       395 output_margin = output_margin,       396 ntree_limit = ntree_limit,   -> 397 validate_features = validate_features)       398       399 def apply(self,X,ntree_limit = 0):

     

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py在predict(self,data,output_margin,ntree_limit,pred_leaf,pred_contribs,about_contribs, pred_interactions,validate_features)      1206      1207如果validate_features:   -> 1208 self._validate_features(数据)      1209      1210长度= c_bst_ulong()

     

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in _validate_features(self,data)      1508      1509 Value Value Error ms ms   -> 1510 data.feature_names))      1511      1512 def get_split_value_histogram(self,feature,fmap ='',bins = None,as_pandas = True):

     

ValueError:feature_names不匹配:['f0','f1','f2','f3','f4','f5','f6','f7','f8','f9',' f10”,“ f11”,“ f12”,“ f13”,“ f14”,“ f15”,“ f16”,“ f17”,“ f18”,“ f19”,“ f20”,“ f21”,“ f22” ,'f23','f24','f25','f26','f27','f28','f29','f30','f31','f32','f33','f34',' f35','f36'] ['Id','MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2 ','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr 'TotRmsAbvGrd','壁炉','GarageYrBlt','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','Mo ','YrSold']   预期的f9,f6,f14,f27,f18,f7,f8,f23,f17,f22,f35,f0,f28,f29,f20,f31,f36,f25,f11,f21,f12,f24,f34,f10,f5 ,f32,f15,f26,f30,f1,f2,f16,f19,f3,f4,f33,f13   训练数据没有以下字段:BsmtUnfSF,1stFlrSF,LowQualFinSF,MSSubClass,WoodDeckSF,GrLivArea,MiscVal,YearBuilt,BsmtFinSF1,Fireplaces,MoSold,BsmtHalfBath,GarageYrBlt,FullBath,PoolArea,YrSoldblow,FloatAqual,YrSold, ,封闭式门廊,屏幕门廊,车库区,BsmtFullBath,MasVnrArea,TotRmsAbvGrd,TotalCond,BedroomAbvGr,车库车,OpenPorchSF,YearRemodAdd,TotalBsmtSF,BsmtFinSF2,LotFrontage,3AsnPorch,

它抱怨功能不匹配,并且训练数据中没有这些字段。但是,当我检查data的内容时,它具有这些列。如何解决?

1 个答案:

答案 0 :(得分:0)

只需结束这个问题即可:

问题是SimpleImputer用于训练和验证数据,而不用于测试数据。

有关导致这种错误的原因的讨论可以在这里找到:https://github.com/dmlc/xgboost/issues/2334#issuecomment-333195491