Question

我需要做的是

应用逻辑回归分类器
使用AUC报告每个类别的ROC。
使用逻辑回归的估计概率来指导ROC的构建。
5倍交叉验证，用于训练模型。

为此，我的方法是使用this非常好的教程：

根据他的想法和方法，我只是改变了获取原始数据的方式：

df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

我只是运行代码。如果我尝试运行accuracy或
之类的指标 balanced_accuracy一切正常（即使使用许多其他指标）。我的问题是，当我尝试使用指标roc_auc运行时，出现错误：

“ ValueError：y_true中仅存在一个类。ROCAUC得分不是在这种情况下定义。”

此错误已在here1，here2，here3和here4中讨论过。但是，我无法使用他们提供的任何“解决方案” /解决方法来解决我的问题。

我的整个代码是：

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
from sklearn.preprocessing import StandardScaler
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'qt')
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

#print(X)
#print(Y)


seed = 7

# prepare models
models = []
models.append(('LR', LogisticRegression()))

# evaluate each model in turn
results = []
names = []
scoring = 'roc_auc'
for name, model in models:
    kfold = model_selection.KFold(n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)



# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Answer 1

虹膜数据集通常相对于类进行排序。因此，在不进行改组的情况下进行拆分时，测试数据集可能只会得到一个类。

一个简单的解决方案是使用shuffle参数。

kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)

即使roc_auc也不直接支持多类格式（iris-数据集具有三个类）。

通过this链接可了解有关如何在多类情况下使用roc_auc的更多信息。

Answer 2

理想地，对于分类任务，使用分层k折迭代，以保留训练折和测试折中的类平衡。

在scikit学习cross_val_score中，交叉验证的默认行为取决于任务。该文档说：-

cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
没有，要使用默认的三折交叉验证，

整数，用于指定（分层）KFold中的折叠数，   简历分配器，


可迭代的屈服（训练，测试）拆分为索引数组。



对于整数/无输入，如果估计量是分类器，y是二进制或多类，则使用StratifiedKFold。在所有其他情况下，都使用KFold。

现在，虹膜数据集是一组150个样本的集合，这些样本按类别（鸢尾鸢尾，鸢尾鸢尾和杂色鸢尾）排序。因此，使用简单的5倍K折叠迭代器将处理训练集中的前120个样本和测试集中的后30个样本。最后30个样本属于单一鸢尾花色类。

因此，如果您没有任何特定理由使用KFold，则可以执行以下操作：

cv_results = model_selection.cross_val_score(model, X, Y, cv=5, scoring=scoring)

但是现在出现scoring的问题。您正在使用'roc_auc'，它仅用于二进制分类任务。因此，要么选择其他指标代替roc_auc，要么指定您要将哪个类别视为正，将其他类别视为负。

将ROC AUC得分与Logistic回归和虹膜数据集一起使用

2 个答案: