XGBoost对所有测试数据显示相同的预测

时间:2018-10-30 13:10:15

标签: xgboost

我正在研究根据某些输入值预测输出标签的问题。 由于我没有真实的数据,因此我正在创建一些虚拟数据,以便在获取数据时可以准备好代码。 下面是示例数据的样子。有很多输入值,最后一列“输出”是要预测的输出标签。

input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27

由于这是伪数据,因此我将输出标签设置为具有最大值的输入。 例如在第一行中,最大值位于第12个位置,因此输出设置为loc12。 我的期望是XGBoost算法应该自己学习并正确预测输出标签。

我写了下面的代码来训练和测试XGBoost。

from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer

df=pd.read_csv("data.txt", sep=',')

# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]

# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values

# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values

# Get the count of  unique output labels
num_classes = df.output.nunique()

lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())

# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255

# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))

# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))

但是我观察到它总是在预测标签0,即单热编码输出中的第一个索引。

输出:

[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softprob = 0.64846954875355

预测:

pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
        0.07965304, 0.07965304, 0.07965304, 0.07965304],
       [0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
        0.07961877, 0.07961877, 0.07961877, 0.07961877],
       [0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
        0.08058234, 0.08058234, 0.08058234, 0.08058234],
       [0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
        0.07947975, 0.07947975, 0.07947975, 0.07947975],
       [0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
        0.08021881, 0.08021881, 0.08021881, 0.08021881],
       [0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
        0.07970817, 0.07970817, 0.07970817, 0.07970817],
       [0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
        0.07897293, 0.07897293, 0.07897293, 0.07897293],
       [0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
        0.07948799, 0.07948799, 0.07948799, 0.07948799],
       [0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
        0.07956778, 0.07956778, 0.07956778, 0.07956778],
       [0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
        0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)

无论我获得什么精度,都是因为预测了标签0大约是数据的35%。

我的期望在这里正确吗?输入功能太多而数据太少,以至于无法正确学习?

完整代码:Here

测试数据:Here

1 个答案:

答案 0 :(得分:0)

对于像我这样的其他人,请检查您的xgb.train参数:'num_boost_round'。确保它与xgb.cv相等或大致相同。 我认为问题在于该模型尚未经过训练,因此停止为时过早。