我有大约800行和16列的数据集。该数据集进一步分为多个数据集。让我们获取我的样本数据
Score Col1 Col2 Paper Score1
1 2 6 1 1
2 5 0 1 2
3 1 13 1 3
4 1 0 0 4
5 2 6 1 5
6 5 0 1 6
7 1 13 1 7
8 1 0 0 8
9 2 6 1 9
10 5 0 1 10
这是我的代码
df1 = pd.read_csv("C:File.csv")
y = df1["Paper"]
df_arr = df1['Score1'].values
X = df1.ix[:,:-2]
for i in df_arr:
X['score'] = i
cv = StratifiedKFold(y, n_folds=3)
classifier2 = RandomForestClassifier(n_estimators=100,
class_weight="auto",
criterion='gini',
bootstrap=True,
max_features=0.5,
min_samples_split=1,
min_samples_leaf=5,
max_depth=10,
n_jobs=1)
for j, (train, test) in enumerate(cv):
x_train=X[train[0]:train[len(train)-1]]
x_test=X[test[0]:test[len(test)-1]]
y_train= y[train[0]:train[len(train)-1]]
y_test=y[test[0]:test[len(test)-1]]
probas1_ = classifier2.fit(x_train, y_train).predict( x_test)
f1_scoree = f1_score(y_test, probas1_, average='binary')
p_score = precision_score(y_test, probas1_, average='binary')
r_score = recall_score(y_test, probas1_, average='binary')
f1_score_value.append(f1_scoree)
Recall_score.append(r_score)
Precision_score.append(p_score)
np.hstack(f1_score_value)
np.hstack(Recall_score)
np.hstack(Precision_score)
代码在y中具有文件存储“ paper”属性,因为这是我们需要预测的,在“ X”中具有其余列。我从'score1'属性创建了一个数组,然后将其从数据框'X'中排除。
for i in df_arr:
X['score'] = i
用当前数组值替换第一整列。这样一来,最终我在Precision,Recall和F1数组中有10个值(因为创建了10个数据框)。现在,我不希望这些值我想要的是这10个数据帧的精确调用曲线。我可以这样绘制单个数据框的精度调用曲线。但是我不知道如何使用此代码创建10个数据帧的精确召回曲线。
# y_real1.append(y_test)
# y_proba1.append(probas1_[:, 1])
#
# y_real1 = numpy.concatenate(y_real1)
# y_proba1 = numpy.concatenate(y_proba1)
#
# precision1, recall1, _ = precision_recall_curve(y_real1, y_proba1)
#
# lab2 = 'Random Forest(area = %0.2f)' % (auc(recall1, precision1))
#
# plt.plot(recall1, precision1, label=lab2, lw=2, color='red')
#
# plt.xlim([0, 1.000000001])
# plt.ylim([0, 1.05])
# plt.grid(True)
# plt.xlabel('Recall')
# plt.ylabel('Precision')
# plt.title('Precision Recall curve')
# plt.rcParams['axes.facecolor']='white'
# plt.legend(loc="upper left", bbox_to_anchor=(1,1))
# plt.show()