所以我有一个非常有挑战性的数据集可以使用,但即使考虑到这一点,我得到的ROC曲线看起来很奇怪而且看起来不对。
下面是我的代码 - 我使用scikitplot库(skplt)在传递我的预测和地面真相标签后绘制ROC曲线,所以我不能合理地弄错了。我在这里找不到一些疯狂的东西吗?
# My dataset - note that m (number of examples) is 115. These are histograms that are already
# summed to 1 so I am doubtful that further preprocessing is necessary.
X, y = load_new_dataset(positives, positive_files, m=115, upper=21, range_size=10, display_plot=False)
# Partition - class balance is 0.87 : 0.13 for negative and positive classes respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Pick a baseline classifier - Naive Bayes
nb = GaussianNB()
# Very large class imbalance, so use stratified K-fold cross-validation.
cross_val = StratifiedKFold(n_splits=10)
# Use RFE for feature selection
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
# Create pipeline, nothing fancy here
clf = Pipeline(steps=[("feature selection", selector), ("classifier", nb)])
# Score using F1-score due to class imbalance - accuracy unlikely to be meaningful
scores = cross_val_score(clf, X_train, y_train, cv=cross_val,
scoring=make_scorer(f1_score, average='micro'))
# Fit and make predictions. Use these to plot ROC curves.
print(scores)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
以下是明显的二元ROC曲线:
我理解我不能指望在这样一个具有挑战性的数据集中表现出色,但即便如此,我也无法理解为什么我会得到这样的二进制结果,特别是对于各个类的ROC曲线。不,我无法获得更多数据,尽管我真诚地希望我能。如果这真的是有效的代码,那么我将只需要处理它并且可能报告微观平均F1分数,这看起来并不太糟糕。
作为参考,在下面的代码片段中使用sklearn的make_classification函数,我得到以下ROC曲线:
# Randomly generate a dataset with similar characteristics (size, class balance,
# num_features)
X, y = make_classification(n_samples=103, n_features=21, random_state=0, n_classes=2, \
weights=[0.87, 0.13], n_informative=5, n_clusters_per_class=3)
positives = np.where(y == 1)
X_minority, X_majority, y_minority, y_majority = np.take(X, positives, axis=0), \
np.delete(X, positives, axis=0), \
np.take(y, positives, axis=0), \
np.delete(y, positives, axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y)
# Cross-validation again
cross_val = StratifiedKFold(n_splits=10)
# Use Naive Bayes again for consistency
clf = GaussianNB()
# Likewise for the evaluation metric
scores = cross_val_score(clf, X_train, y_train, cv=cross_val, \
scoring=make_scorer(f1_score, average='micro'))
print(scores)
# Fit, predict, plot results
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, y_pred)
plt.show()
我做错了吗?或者这是我应该期待的这些特征吗?
答案 0 :(得分:0)
感谢Stev提出的增加测试尺寸的建议,我得到的曲线变得更加平滑并且表现出更少的变化。在这种情况下使用SMOTE也非常有用,我会建议它(也许使用imblearn)给其他有类似问题的人。