Question

我正在尝试将XGBRegressor（）应用于不平衡的数据集（97％/ 3％）并评估结果，但是在生成正确的评估指标时存在问题。

我选择了SMOTE来过度采样目标变量。

X = multiSdata.filter(['col1', 'col2','col3','col4', 'col5','col6','col7','col8',
                       'col9','col10','col11','col12','col13','col14','col15','col16','col17',
                       'col18','col19','col20','col21','col22','col23','col24'])
# retain the original feature labels
feature_labels = pd.Series(X.columns.values)

X.head(5)
[![enter image description here][1]][1]

X_train, X_test, y_train, y_test  =   train_test_split(X, y, test_size=.3, random_state=27)

print( "Predictor - Training : ", X_train.shape, "Predictor - Testing : ", X_test.shape, "Target - Training : ", y_train.shape, "Target - Testing : ", y_test.shape )

输出：预测-培训：（876742，24）预测-测试：（375747，24）目标-培训：（876742，）目标-测试：（375747，）

y_train.value_counts()

输出： 0 824518 1 52224 名称：target，dtype：int64

sm = SMOTE(random_state = 27, ratio = 1.0)
X_train, y_train = sm.fit_sample(X_train.values, y_train.values)

np.bincount(y_train)

输出：数组（[824518，824518]）

xgb = XGBRegressor(learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0.1,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=21,
 eval_metric = ['auc','error'])

SMOTE = xreg.fit(X_train, y_train)

X_test = X_test.as_matrix()
smote_pred = SMOTE.predict(X_test)

import xgboost as xgb
params = {'learning_rate' : 0.1,
 'n_estimators':1000,
 'max_depth':5,
 'min_child_weight':1,
 'gamma':0.1,'subsampl':0.8,'colsample_bytre':0.8, 'objectiv': 'binary:logistic',
 'nthread':4,'scale_pos_weight':1,'seed':21,'eval_metric':['auc','error']}
xg_train = xgb.DMatrix(data=X_train, label=y_train);
cv_results = xgb.cv(params,xg_train,num_boost_round=10,nfold=5,early_stopping_rounds=10)
cv_results

我正在尝试使用交叉验证，但是无法与XGBRegressor一起使用，而是使用xgboost并从X_train和y_train生成了DMatrix。不确定这是否会导致100％的准确性，这肯定是错误的。

我们将对如何进一步排除模型无法产生正确预测的原因提出建议。

Answer 1

过采样可能会创建泄漏的新案例，实际上是测试集案例的重复。像您一样保留测试集不变可能不会阻止这种情况。仅对合并的火车加上测试进行重复数据删除。

如果可行，可以考虑采用欠采样（不引入新的泄漏，但仍然可能）。如果正确完成，这两种采样方法都不会对准确性造成太大影响，因此强烈建议进行重复数据删除。

对于交叉验证，请确保出于相同的原因先删除重复项。

XGBRegressor不断返回100％的准确性

1 个答案: