我使用了三个分类器(RandomForestClassifier
,KNearestNeighborClassifier
和SVM Classifier
),您可以在下面看到:
>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=True, random_state=41, shrinking=True,
tol=0.001, verbose=False)
>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='distance')
>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
在培训期间,RandomForestClassifer
会根据数据提供最佳f1_score
,然后是KNearestNeighborClassifier
,然后是SVMClassifier
。这是我的X_train(标准缩放值,如果需要,你可以问我是怎么得到的)& y_train:
>> X_train
array([[-0.11034393, -0.72380296, 0.15254572, ..., 0.4166148 ,
-0.91095473, -0.91095295],
[ 1.6817184 , 0.40040944, -0.6770607 , ..., -0.2403781 ,
0.02962478, 0.02962424],
[ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
-0.15848331, -0.15847739],
...,
[-1.18666853, 0.87297522, 0.47136779, ..., -0.19599824,
0.72417473, 0.72416714],
[ 1.6835304 , 0.40605067, -0.63383059, ..., -0.37094083,
0.09505496, 0.09505389],
[ 0.19950709, -1.04624152, -0.18351693, ..., 0.4362658 ,
-0.77994791, -0.77994176]])
>> y_train_sl
874 0
1863 0
1493 0
288 1
260 0
495 0
1529 0
1704 1
75 1
1792 0
626 0
99 1
222 0
774 0
52 1
1688 1
1770 0
53 1
1814 0
488 0
230 0
481 0
132 1
831 0
1166 1
1593 0
771 0
1785 0
616 0
207 0
..
155 1
1506 0
719 0
547 0
613 0
652 0
1351 0
304 0
1689 1
1693 1
1128 0
1323 0
763 0
701 0
467 0
917 0
329 0
375 0
1721 0
928 0
1784 0
1200 0
832 0
986 0
1687 1
643 0
802 0
280 1
1864 0
1045 0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8
正如您所见,我的y_train采用布尔形式(即实例位于True
且False
位置。
我希望通过predict_proba
使用RandomForestClassifier
进一步提高预测的准确性,当我看到分类器的预测时(先说KNearestNeighborClassifier
首先)有一个预测关于它预测的特定实例(这是我应该首先找到的)的低置信水平(<60%),它移动到下一个分类器(让我们说f1_score
)并检查置信水平这些实例上的下一个分类器的那些实例,如果它与前一个分类器相比具有高置信度(> 60%),则接受来自该分类器的解决方案,类似地,如果该分类器对相同实例具有较低的置信度仍然(&lt; 60%),移动到下一个分类器并为第三个分类器做同样的事情。
最后,如果第三分类器也具有较低的置信水平(<60%),我需要接受来自分类器的解决方案,该分类器在所有三个分类器中具有最高置信度。
因为,我是机器学习的新手,我可能会对你道歉的一些陈述感到困惑,所以请纠正我错在哪里。
修改
X_test和y_test如下所示。我需要预测X_test_prepared并使用>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
0.32099948, 0.32099952],
[ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
-0.54261829, -0.54261947],
[ 1.67447042, 0.24530384, -1.0113221 , ..., -0.54844942,
-0.26066608, -0.26066032],
...,
[ 0.28104683, 1.52670909, 0.62653301, ..., -1.15596295,
2.05859487, 2.05859247],
[ 1.50595496, 0.84507934, -0.44109634, ..., -0.71277072,
0.14474518, 0.14474398],
[-1.63423112, -0.12690448, 0.48577783, ..., -0.36025459,
0.29137477, 0.29137047]])
>> y_test_sl
1321 0
1433 0
1859 0
1496 0
492 0
736 0
996 0
1001 0
634 0
1486 0
910 0
1579 0
373 0
1750 0
1563 0
1584 0
51 1
349 0
1162 1
594 0
1121 0
1637 0
1116 0
106 1
1533 0
993 0
960 0
277 0
142 1
1010 0
..
1104 1
1404 0
1646 0
1009 0
61 1
444 0
10 1
704 0
744 0
418 0
998 0
740 0
465 0
97 1
1550 1
1738 0
978 0
690 0
1071 0
1228 1
1539 0
145 1
1015 0
1371 0
1758 0
315 0
71 1
1090 0
1766 0
33 1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8
评估预测和y_test_sl。预测的y必须通过所有三个分类器,并且对所有实例具有最佳置信度。
filter
答案 0 :(得分:0)
这里的目标是创建一个分类器的集合,并采取最大的&#34;自信&#34;所有分类器的(最高概率等级)预测。代码如下:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_features=4) # Put your training data here instead
# parameters for random forest
rfclf_params = {
'bootstrap': True,
'class_weight':None,
'criterion':'entropy',
'max_depth':None,
'max_features':'auto',
# ... fill in the rest you want here
}
# Fill in svm params here
svm_params = {
'probability':True
}
# KNeighbors params go here
kneighbors_params = {
}
params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]
def ensemble(classifiers, params, X_train, y_train, X_test):
best_preds = np.zeros((len(X_test), 2))
classes = np.unique(y_train)
for i in range(len(classifiers)):
# Construct the classifier by unpacking params
# store classifier instance
clf = classifiers[i](**params[i])
# Fit the classifier as usual and call predict_proba
clf.fit(X_train, y_train)
y_preds = clf.predict_proba(X_test)
# Take maximum probability for each class on each classifier
# This is done for every instance in X_test
# see the docs of np.maximum here:
# https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
best_preds = np.maximum(best_preds, y_preds)
# map the maximum probability for each instance back to its corresponding class
preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
return preds
# Test your predictions
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))
如果您希望算法返回最高概率而不是预测类,请ensemble
返回[np.amax(pred_probs) for pred_probs in best_preds]
而不是preds。