我的班级不平衡有问题。类是0,1,2
0级相对于1级,2级非常不平衡
这是我的代码:
parameters = [{'kernel': ['linear'], 'C': [1, 10, 100]},
{'kernel': ['rbf'], 'gamma': [1e-2,1e-3, 1e-4],'C': [1, 10, 1000, 5000], }]
tfidf = TfidfVectorizer( ngram_range=(1, 20))
clf=GridSearchCV(SVC(class_weight='balanced'),parameters,cv=2,refit=True)
model= make_pipeline(tfidf,clf)
model.fit(X_train, y_train)
print("Best parameters set:",clf.best_params_)
print("Grid scores on every set of parameters:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.04f) for %r"
% (mean, std * 2, params))
print()
print("Classification report:")
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Test accuracy:",accuracy_score(y_test, y_pred))
labels = model.classes_
matrix = confusion_matrix(y_test,y_pred)
print(pd.DataFrame(matrix,columns=labels, index=labels))
plot_confusion_matrix(matrix,labels)
从混淆矩阵中,我看到类之间的平衡不佳。我该怎么办?
谢谢
答案 0 :(得分:1)
您是什么意思,他们不均衡?您是否认为原始数据框也可能失衡?
您还应该在y
(y_train
)和(y_test
)的分发中,我认为大多数数据位于类0中。
您还应该运行scikit learning(策略:最常见)中的虚拟分类器,以查看该策略将导致什么准确性和混乱矩阵,我假设那时的测试准确性约为0.8。