Question

您好，我通过10倍交叉验证运行随机森林递归特征提取，因此我需要报告所有10倍均值的均值，标准差和p值（对于我使用的每个数据集）。对于我的一生，我不知道该怎么做。

这是我正在运行的代码：

# random forest#######################################
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Build a classification task using 8 informative features
# If you want to reproduce the problem
X, y = make_classification(n_samples=1000, n_features=75, n_informative=8,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

# split data into train and test split
from sklearn.cross_validation import train_test_split
# if we need train test split
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3)

# Create the RFE object and compute a cross-validated score.
rfc = RandomForestClassifier(n_estimators=128)
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(10),
              scoring='accuracy')
rfecv.fit(X_train, y_train)

print("Optimal number of features : %d" % rfecv.n_features_)
print(rfecv.ranking_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

rfecv.predict(X_test)
ranking = rfecv.ranking_
y_hats = rfecv.predict(X_test)
predictions = [round(value) for value in y_hats]
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (accuracy*100.0))

添加了make_classification，以便您可以重现此问题，我正在使用不同的数据集，我希望它可以正常工作，但是我不确定，但我将其包括在内只是为了遵循有关SO的发布问题准则。如果不对，我事先表示歉意。谢谢！

如何在10倍交叉验证中获得均值，标准差和准确性得分的p值

0 个答案: