Question

这是我的第一个StackOverflow问题，我需要帮助！我已经通过实验详尽地搜索了答案，但我希望社区中的某些人可以提供帮助。

这是我在Uni学位论文的工作，所以任何帮助都会非常感激。

我会尽量总结：

我正在使用Scikit学习分类器并尝试使用GridSearchCV调整/简化它们，以便为将来使用Keras / Tensorflow工作形成基线。
目前我的问题在于RandomForestClassifier / GridSearchCV。
我正在使用大量数据。来自Kaggle的信用卡欺诈数据here.
数据不平衡，因此我使用SMOTE进行过采样，因此0级和1级（欺诈）的训练分割相等。每个约200,000个。

现在解释一下这个问题：

当我在RandomForestClassifer上为此数据运行此GridSearchCV时，召回得分始终为= 1.这意味着没有选择特定参数为“最佳”。另外我不明白为什么总是这样。这需要大约6-8个小时才能运行，所以如果每次迭代都有召回= 1，这就变得毫无意义。
但是，当我只是对数据进行单一拟合（没有GridsearchCV）并进行预测测试时。我得到了80-84％的得分结果（再次对Recall感兴趣）。这当然更为现实。

我的想法/实验：

我尝试将数据采样到每个类别的492，每次GSCV迭代产生约90％。似乎更好，但仍然明显高于平均水平。
还尝试了不同的训练集大小（50,000,10000，...），并且每次迭代都会给出召回= 1。

我的猜测是，有太多的数据/过度拟合/为什么会发生这种情况。或者，我认为Gridsearch正在采用整体/非欺诈分类指标，在这些情况下接近1。

以下是在{0：200,000,1：200,000}训练集上运行GSCV的输出图： GSCV each iteration recall=1 您可以看到，每次折叠得分= 1，但在使用模型进行测试/预测后，我们在分类报告中得到看似有效的80％ish指标。

我知道测试集的欺诈案例数量相当少（只有几百个）。但这是因为我只对训练数据进行了过度采样，以保持新的（看不见的）测试数据。

因此，通过查看分类报告，我认为GridSearchCV可能采用了错误的值（即我们对class = 1指标感兴趣）。但是看一下docs，Pos_label = 1是skikit-learn中得分手的默认值。所以这不应该是问题。

我尝试过自定义得分手/默认得分手等。

这是我的代码（有点乱，但应该清楚发生了什么！注意注释掉单个RF分类器，没有GridSearch）：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import itertools

data = pd.read_csv("creditcard.csv")

# Normalise and reshape the Amount column, so it's values lie between -1 and 1
from sklearn.preprocessing import StandardScaler
data['norm_Amount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))

# Drop the old Amount column and also the Time column as we don't want to include this at this stage
data = data.drop(['Time', 'Amount'], axis=1)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

########################################################
# MODEL SETUP

# Assign variables x and y corresponding to row data and it's class value
X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

# Whole dataset, training-test data splitting
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

from collections import Counter
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=1)
X_res, y_res = sm.fit_sample(X_train, y_train)
print('Original dataset shape {}'.format(Counter(data['Class'])))
print('Training dataset shape {}'.format(Counter(y_train['Class'])))
print('Resampled training dataset shape {}'.format(Counter(y_res)))



print 'Random Forest: '
from sklearn.ensemble import RandomForestClassifier

# rf = RandomForestClassifier(n_estimators=250, criterion="gini", max_features=3, max_depth=10)

rf = RandomForestClassifier()
param_grid = { "n_estimators"      : [250, 500, 750],
           "criterion"         : ["gini", "entropy"],
           "max_features"      : [3, 5]}

from sklearn.metrics import recall_score, make_scorer
scorer = make_scorer(recall_score, pos_label=1)


grid_search = GridSearchCV(rf, param_grid, n_jobs=1, cv=3, scoring=scorer, verbose=50)
grid_search.fit(X_res, y_res)
print grid_search.best_params_, grid_search.best_estimator_

# rf.fit(X_res, y_res)
# y_pred = rf.predict(X_test)
y_pred = grid_search.predict(X_test)
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)
print 'Test recall score: ', recall_score(y_test, y_pred)

谢谢，

哈利

Answer 1

这是过度拟合的问题。当您使用交叉验证和过采样时，重要的是过采样应仅应用于训练数据而不应用于验证数据，即10倍交叉验证，9倍过采样数据将用作训练集，并且一次作为没有过采样的验证集。

使用GridSearchCV和RandomForestClassifier使用大数据的问题，总是显示召回得分= 1，因此最佳参数变得多余

1 个答案: