如何避免SMOTE功能名称不匹配?

时间:2019-03-31 13:06:22

标签: python arrays numpy xgboost oversampling

我正在建立一个GBM来计算可能性很小的模型,并且我的模型的性能与我的特征符合随机数(即很差),所以我试图使用Smote来克服我对结果的支配(98.55% 0,1.45%1)。

解决方案here似乎暗示我的问题来自不是数组的类型,但我的代码暗示确实如此。

我的数据如下:

X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']

X
   Underwriting Year  Public Liability Limit  Employers Liability Limit  \
0               2014                 1000000                          0   
1               2014                 5000000                          0   
2               2014                 5000000                   10000000   
3               2014                 2000000                          0   
4               2014                 1000000                          0   
   Tools Sum Insured  Professional Indemnity Limit  \
0                0.0                         50000   
1                0.0                             0   
2             4000.0                             0   
3             2000.0                             0   
4                0.0                       1000000   

   Contract Works Sum Insured  Hired in Plan Sum Insured  Manual EE  \
0                           0                          0          1   
1                           0                          0          1   
2                           0                          0          1   
3                           0                          0          6   
4                           0                          0          1   

   Clerical EE  Subcontractor EE  rand_1  rand_2  rand_3  rand_4  rand_5  \
0            0                 0       1       2       2       1       5   
1            0                 0       4       3       1       2       2   
2            7                 0       2       2       4       1       5   
3            4                 0       5       4       1       2       2   
4            0                 0       4       3       4       5       2   

   rand_6  rand_7  rand_8  rand_9  rand_10  
0       2       3       5       1        1  
1       4       3       1       1        5  
2       2       5       3       1        5  
3       1       5       1       3        2  
4       5       2       5       4        3  

Y
0    0
1    0
2    0
3    0
4    0
Name: Has Claim, dtype: int64

我要进行火车试车

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2, 
                                                    random_state=42)

当我适合我的模型时,它会起作用

model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
   colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
   max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
   n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
   reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
   subsample=0.8)

但是如果我使用

smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
                                  y_train)

然后修改我的模型并使用

y_pred = model.predict(X_test)

然后我得到

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured

我希望能够使用更新后的模型进行预测

我误解了SMOTE的工作原理吗?我没有正确应用它吗?

0 个答案:

没有答案