我正在尝试运行Python笔记本(link)。在[446]中的下面一行:作者培训XGBoost
所在的地方,我遇到了错误
ValueError:数据的DataFrame.dtypes必须为int,float或bool。 没想到StateHoliday,Assortment
字段中的数据类型
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
这是最少的测试代码
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day',
'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen',
'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, # 30% for the evaluation set
random_state = 42)
# base parameters
params = {
'booster': 'gbtree',
'objective': 'reg:linear', # regression task
'subsample': 0.8, # 80% of data to grow trees and prevent overfitting
'colsample_bytree': 0.85, # 85% of features used
'eta': 0.1,
'max_depth': 10,
'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
答案 0 :(得分:1)
我在进行Rossmann销售预测项目时遇到了完全相同的问题。 似乎新版本的xgboost不接受 StateHoliday , Assortment 和 StoreType 的数据类型。 您可以使用
检查Mykhailo Lisovyi建议的数据类型print(test_train.dtypes)
您需要在此处将test_train替换为X_train
您可能会得到
DayOfWeek int64
Promo int64
StateHoliday int64
SchoolHoliday int64
StoreType object
Assortment object
CompetitionDistance float64
CompetitionOpenSinceMonth float64
CompetitionOpenSinceYear float64
Promo2 int64
Promo2SinceWeek float64
Promo2SinceYear float64
Year int64
Month int64
Day int64
错误上升到 object 类型。您可以使用
进行转换from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
经过这些步骤,一切都会顺利进行。
答案 1 :(得分:1)
尝试一下
train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
答案 2 :(得分:0)
如错误消息所示,xgboost
很不高兴,您尝试向其提供未知类型。它说它不能处理分类或日期时间功能。检查StateHoliday, Assortment
功能的类型,并以某种方式将它们编码为数字(例如,一键编码,标签编码(适用于基于树的模型)或目标编码)
答案 3 :(得分:0)
H2O软件包中的XGBoost版本可以处理分类变量(但不能太多!),但XGBoost作为其自己的软件包似乎无法处理。
我用pandas数据框尝试了此操作,但xgboost不喜欢它
categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')
要在分类中使用H2O,必须先将字符串转换为分类:
h2odf[categoricals] = h2odf[categoricals].asfactor()
还要注意,h2o具有与熊猫不同的数据框。