我正在建立一个GBM来计算可能性很小的模型,并且我的模型的性能与我的特征符合随机数(即很差),所以我试图使用Smote来克服我对结果的支配(98.55% 0,1.45%1)。
解决方案here似乎暗示我的问题来自不是数组的类型,但我的代码暗示确实如此。
我的数据如下:
X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']
X
Underwriting Year Public Liability Limit Employers Liability Limit \
0 2014 1000000 0
1 2014 5000000 0
2 2014 5000000 10000000
3 2014 2000000 0
4 2014 1000000 0
Tools Sum Insured Professional Indemnity Limit \
0 0.0 50000
1 0.0 0
2 4000.0 0
3 2000.0 0
4 0.0 1000000
Contract Works Sum Insured Hired in Plan Sum Insured Manual EE \
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 6
4 0 0 1
Clerical EE Subcontractor EE rand_1 rand_2 rand_3 rand_4 rand_5 \
0 0 0 1 2 2 1 5
1 0 0 4 3 1 2 2
2 7 0 2 2 4 1 5
3 4 0 5 4 1 2 2
4 0 0 4 3 4 5 2
rand_6 rand_7 rand_8 rand_9 rand_10
0 2 3 5 1 1
1 4 3 1 1 5
2 2 5 3 1 5
3 1 5 1 3 2
4 5 2 5 4 3
Y
0 0
1 0
2 0
3 0
4 0
Name: Has Claim, dtype: int64
我要进行火车试车
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
当我适合我的模型时,它会起作用
model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
subsample=0.8)
但是如果我使用
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
y_train)
然后修改我的模型并使用
y_pred = model.predict(X_test)
然后我得到
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured
我希望能够使用更新后的模型进行预测
我误解了SMOTE的工作原理吗?我没有正确应用它吗?