麻烦训练xgboost

时间:2019-05-11 07:41:54

标签: python jupyter-notebook xgboost

我正在尝试运行Python笔记本(link)。在[446]中的下面一行:作者培训XGBoost所在的地方,我遇到了错误

  

ValueError:数据的DataFrame.dtypes必须为int,float或bool。                   没想到StateHoliday,Assortment

字段中的数据类型
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

这是最少的测试代码

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

链接到train_store数据文件:Link 1 Link 2

4 个答案:

答案 0 :(得分:1)

我在进行Rossmann销售预测项目时遇到了完全相同的问题。 似乎新版本的xgboost不接受 StateHoliday Assortment StoreType 的数据类型。 您可以使用

检查Mykhailo Lisovyi建议的数据类型
print(test_train.dtypes)

您需要在此处将test_train替换为X_train

您可能会得到

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

错误上升到 object 类型。您可以使用

进行转换
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

经过这些步骤,一切都会顺利进行。

答案 1 :(得分:1)

尝试一下

train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])

答案 2 :(得分:0)

如错误消息所示,xgboost很不高兴,您尝试向其提供未知类型。它说它不能处理分类或日期时间功能。检查StateHoliday, Assortment功能的类型,并以某种方式将它们编码为数字(例如,一键编码,标签编码(适用于基于树的模型)或目标编码)

答案 3 :(得分:0)

H2O软件包中的XGBoost版本可以处理分类变量(但不能太多!),但XGBoost作为其自己的软件包似乎无法处理。

我用pandas数据框尝试了此操作,但xgboost不喜欢它

categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')

要在分类中使用H2O,必须先将字符串转换为分类:

h2odf[categoricals] = h2odf[categoricals].asfactor()

还要注意,h2o具有与熊猫不同的数据框。