通过使用实例上分类器的置信度来提高预测分数

时间:2018-03-21 02:00:43

标签: python machine-learning boolean text-classification

我使用了三个分类器(RandomForestClassifierKNearestNeighborClassifierSVM Classifier),您可以在下面看到:

>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=41, shrinking=True,
  tol=0.001, verbose=False)

>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

在培训期间,RandomForestClassifer会根据数据提供最佳f1_score,然后是KNearestNeighborClassifier,然后是SVMClassifier。这是我的X_train(标准缩放值,如果需要,你可以问我是怎么得到的)& y_train:

>> X_train
array([[-0.11034393, -0.72380296,  0.15254572, ...,  0.4166148 ,
        -0.91095473, -0.91095295],
       [ 1.6817184 ,  0.40040944, -0.6770607 , ..., -0.2403781 ,
         0.02962478,  0.02962424],
       [ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
        -0.15848331, -0.15847739],
       ..., 
       [-1.18666853,  0.87297522,  0.47136779, ..., -0.19599824,
         0.72417473,  0.72416714],
       [ 1.6835304 ,  0.40605067, -0.63383059, ..., -0.37094083,
         0.09505496,  0.09505389],
       [ 0.19950709, -1.04624152, -0.18351693, ...,  0.4362658 ,
        -0.77994791, -0.77994176]])

>> y_train_sl
874     0
1863    0
1493    0
288     1
260     0
495     0
1529    0
1704    1
75      1
1792    0
626     0
99      1
222     0
774     0
52      1
1688    1
1770    0
53      1
1814    0
488     0
230     0
481     0
132     1
831     0
1166    1
1593    0
771     0
1785    0
616     0
207     0
       ..
155     1
1506    0
719     0
547     0
613     0
652     0
1351    0
304     0
1689    1
1693    1
1128    0
1323    0
763     0
701     0
467     0
917     0
329     0
375     0
1721    0
928     0
1784    0
1200    0
832     0
986     0
1687    1
643     0
802     0
280     1
1864    0
1045    0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8

正如您所见,我的y_train采用布尔形式(即实例位于TrueFalse位置。

我希望通过predict_proba使用RandomForestClassifier进一步提高预测的准确性,当我看到分类器的预测时(先说KNearestNeighborClassifier首先)有一个预测关于它预测的特定实例(这是我应该首先找到的)的低置信水平(<60%),它移动到下一个分类器(让我们说f1_score)并检查置信水平这些实例上的下一个分类器的那些实例,如果它与前一个分类器相比具有高置信度(> 60%),则接受来自该分类器的解决方案,类似地,如果该分类器对相同实例具有较低的置信度仍然(&lt; 60%),移动到下一个分类器并为第三个分类器做同样的事情。

最后,如果第三分类器也具有较低的置信水平(<60%),我需要接受来自分类器的解决方案,该分类器在所有三个分类器中具有最高置信度。

因为,我是机器学习的新手,我可能会对你道歉的一些陈述感到困惑,所以请纠正我错在哪里。

修改 X_test和y_test如下所示。我需要预测X_test_prepared并使用>> X_test_prepared array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982, 0.32099948, 0.32099952], [ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097, -0.54261829, -0.54261947], [ 1.67447042, 0.24530384, -1.0113221 , ..., -0.54844942, -0.26066608, -0.26066032], ..., [ 0.28104683, 1.52670909, 0.62653301, ..., -1.15596295, 2.05859487, 2.05859247], [ 1.50595496, 0.84507934, -0.44109634, ..., -0.71277072, 0.14474518, 0.14474398], [-1.63423112, -0.12690448, 0.48577783, ..., -0.36025459, 0.29137477, 0.29137047]]) >> y_test_sl 1321 0 1433 0 1859 0 1496 0 492 0 736 0 996 0 1001 0 634 0 1486 0 910 0 1579 0 373 0 1750 0 1563 0 1584 0 51 1 349 0 1162 1 594 0 1121 0 1637 0 1116 0 106 1 1533 0 993 0 960 0 277 0 142 1 1010 0 .. 1104 1 1404 0 1646 0 1009 0 61 1 444 0 10 1 704 0 744 0 418 0 998 0 740 0 465 0 97 1 1550 1 1738 0 978 0 690 0 1071 0 1228 1 1539 0 145 1 1015 0 1371 0 1758 0 315 0 71 1 1090 0 1766 0 33 1 Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8 评估预测和y_test_sl。预测的y必须通过所有三个分类器,并且对所有实例具有最佳置信度。

filter

1 个答案:

答案 0 :(得分:0)

这里的目标是创建一个分类器的集合,并采取最大的&#34;自信&#34;所有分类器的(最高概率等级)预测。代码如下:

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_features=4) # Put your training data here instead

# parameters for random forest
rfclf_params = {
    'bootstrap': True, 
    'class_weight':None, 
    'criterion':'entropy',
    'max_depth':None, 
    'max_features':'auto', 
    # ... fill in the rest you want here
}

# Fill in svm params here
svm_params = {
    'probability':True
}

# KNeighbors params go here
kneighbors_params = {

}

params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]

def ensemble(classifiers, params, X_train, y_train, X_test):
    best_preds = np.zeros((len(X_test), 2))
    classes = np.unique(y_train)

    for i in range(len(classifiers)):
        # Construct the classifier by unpacking params 
        # store classifier instance
        clf = classifiers[i](**params[i])
        # Fit the classifier as usual and call predict_proba
        clf.fit(X_train, y_train)
        y_preds = clf.predict_proba(X_test)
        # Take maximum probability for each class on each classifier 
        # This is done for every instance in X_test
        # see the docs of np.maximum here: 
        # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
        best_preds = np.maximum(best_preds, y_preds)

    # map the maximum probability for each instance back to its corresponding class
    preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
    return preds

# Test your predictions  
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))

如果您希望算法返回最高概率而不是预测类,请ensemble返回[np.amax(pred_probs) for pred_probs in best_preds]而不是preds。