在我的数据框中,我有3300万行,这是与分类有关的问题。数据框看起来像-
id prodA prodB prodC单身已婚age_20_30 age_40_50 is_purchase
1 .9461 .0539 0 0 1 0 1 0
2 .55 .44 .01 1 0 1 0 1
3 .65 .25 .10 0 0 1 0 1
4 .79 .21 0 0 1 1 1 0
prodA,prodB是乘积亲和力。
我到目前为止所做的-
df = read_csv('final_data.csv')
#Global Variables
global label, id_column, features
label = 'is_purchase'
id_column = 'id'
features = ['prodA', 'prodB', 'prodC', 'Single', 'Married', 'age_20_30','age_40_50']
train, valid, test = np.split(df.sample(frac=1), [int(.8*len(df)), int(.95*len(df))])
X_train, y_train = train[features], train[label]
X_valid, y_valid = valid[features], valid[label]
X_test, y_test = test[features], test[label]
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)
dtest = xgb.DMatrix(X_test, label=y_test)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
params = {
'num_class' : 2,
'learning_rate' : 0.05,
'n_estimators':120,
'max_depth':12,
'min_child_weight':1,
'gamma':2,
'subsample':0.8,
'colsample_bytree':0.5,
'objective':'multi:softprob',
'nthread':4,
'seed':27}
num_round = 100
model = xgb.train(params, dtrain, num_round, watchlist, verbose_eval=1)
valid_pred = model.predict(dvalid)
best_valid_preds = np.asarray([np.argmax(line) for line in valid_pred])
print(precision_score(y_valid, best_valid_preds, average='macro'))
print(recall_score(y_valid, best_valid_preds, average='macro'))
print(f1_score(y_valid, best_valid_preds, average='macro'))
print(accuracy_score(y_valid, best_valid_preds))
test_pred = model.predict(dtest)
best_test_preds = np.asarray([np.argmax(line) for line in test_pred])
print(precision_score(y_test, best_test_preds, average='macro'))
print(recall_score(y_test, best_test_preds, average='macro'))
print(f1_score(y_test, best_test_preds, average='macro'))
print(accuracy_score(y_test, best_test_preds))
我的测试仪仅给我54%的精度,而我希望至少提高70%以上。数据集没有任何NA值。如何使用XGBoost(借助参数调整)提高模型精度?