我有一个股票收益数据集,其中Y标签是价格变化方向(如果向上变动,则为2;如果向下变动,则为1;如果没有变动,则为0。某些特征X包括滞后)标签值(即前一天的价格方向变化)。
我正在尝试运行XGBoost分类模型,但是我的数据高度不平衡。大多数Y标签值= 0,表示股价没有变动。
如何将这种不平衡问题纳入多标签XGBoost分类问题中?
我的代码如下:
X = df[["ret_D_lag_1", "ret_D_lag_2", "ret_D_lag_3"]]
y = df["ret_D_t1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# use DMatrix for xgboost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# set xgboost params
param = {
'max_depth': 3, # the maximum depth of each tree
'eta': 0.3, # the training step for each iteration
'silent': 1, # logging mode - quiet
'objective': 'multi:softprob', # error evaluation for multiclass training
'num_class': 3} # the number of classes that exist in this datset
num_round = 20 # the number of training iterations
# Train the model
bst = xgb.train(param, dtrain, num_round)
# Predict and choose highest probability for each label
preds = bst.predict(dtest)
best_preds = np.asarray([np.argmax(line) for line in preds])