提高班级失衡的准确性/召回率?

时间:2019-04-22 19:48:01

标签: python machine-learning scikit-learn classification

试图提高两个类的精度/召回率...有什么提示吗?

  • 我具有多种功能[几个var,几个cat var和2个文本var]
  • 目标是类别不平衡的二进制分类[大约85%的1级和15%的0级]
  • 没有太多的训练数据(仅约17,000行)

这是我的管道:

cat_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])

num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('num_scaler', StandardScaler())])

text_transformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=SPLIT_PATTERN,\
                                 stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()

text_transformer_1 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=SPLIT_PATTERN,\
                                 stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()

FE = ColumnTransformer(
    transformers=[
        ('cat', cat_transformer, CAT_FEATURES),
        ('num', num_transformer, NUM_FEATURES),
        ('text0', text_transformer_0, TEXT_FEATURES[0]),
        ('text1', text_transformer_1, TEXT_FEATURES[1])])

pipe = Pipeline(steps=[('feature_engineer', FE),
                     ("scales", MaxAbsScaler()),
                     ('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])

random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
              "rand_forest__n_estimators": sp_randint(10, 100),\
              "rand_forest__max_features": ["auto", "sqrt", "log2", None],\
              "rand_forest__bootstrap": [True, False],\
              "rand_forest__criterion": ["gini", "entropy"]}

strat_shuffle_fold = StratifiedKFold(n_splits=5,\
  random_state=123,\
  shuffle=True)

cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
cv_train.fit(X_train, y_train)

from sklearn.metrics import classification_report, confusion_matrix
preds = cv_train.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))

基于分类报告的尝试的许多不同组合,平均而言,我正在解决:

  • 1级=>大约95%精度; 98%的召回率
  • class 0 =>大约80-85%的精度; 57-66%的召回率

当我执行分层的k-fold洗牌并添加class_weight ='balanced'时,召回率可以达到66%,但希望达到75%-80%左右

问题:

    我可以做其他任何可以改善预测0级预测的特征工程技术吗? [在所有功能上都对TFIDF,Hashing Trick,selectKBest,SVD()和maxAbsScaler()等文本尝试了不同的操作] 我应该尝试其他算法吗? [仅尝试了随机森林分类器]
  1. 召回率低吗?
  2. 大多数情况下只是“即插即用” ...我想念的任何明显东西吗?
  3. 会应用过度采样帮助吗?如果是这样,怎么在python / sklearn中完成?

任何帮助将不胜感激!

0 个答案:

没有答案