显示异常行为的RandomForestClassifier的随机超参数搜索

时间:2019-04-02 01:03:27

标签: python machine-learning scikit-learn

我正在尝试构建一个自定义的随机搜索类,该类模仿scikit-learn中的 RandomizedSearchCV 的某些功能。它并不意味着比scikit-learn内置的更好。我主要是想学习。

我的数据具有以下维度。大量功能可以归因于一种热编码值。

CODE:
print(X.shape)
print(y.shape)

OUT:
(2572, 258)
(2572,)

我已经创建了一个参数网格,如下所示:

CODE:
# define the range of hyperparameters
n_estimators = [np.int64(x) for x in np.linspace(100, 1000, 10)]
max_depth = [np.int64(x) for x in np.linspace(1, 50, 10)]
min_samples_split = np.linspace(0.1, 1, 10)
min_samples_leaf = np.linspace(0.1, 0.5, 10, endpoint=False)
# max_features = [np.int64(x) for x in np.linspace(1, X.shape[1], 10, endpoint=False)]
max_features = [np.int64(x) for x in np.linspace(1, 30, 10, endpoint=False)]

# create a dictionary with all the hyperparameters 
hp_dict = {
           'n_estimators':n_estimators,
           'max_depth':max_depth,
           'min_samples_split':min_samples_split,
           'min_samples_leaf':min_samples_leaf,
           'max_features':max_features
          }

# print the differnt range of hyper parameters used 
for key, value in hp_dict.items():
  print('{:<20} : {}'.format(key, value))


OUT:
n_estimators         : [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
max_depth            : [1, 6, 11, 17, 22, 28, 33, 39, 44, 50]
min_samples_split    : [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
min_samples_leaf     : [0.1  0.14 0.18 0.22 0.26 0.3  0.34 0.38 0.42 0.46]
max_features         : [1, 3, 6, 9, 12, 15, 18, 21, 24, 27]

这是我对RandomizedSearchCV类的实现:

CODE:
class CustomRandomSearch:
  def __init__(self, X, y, clf, hp_dict, n_settings=100, cv=3, best_metric='accuracy', random_seed=None):
    # set the random seed 
    np.random.seed(seed=random_seed)

    # initalize a DataFrame of the metrics
    col_names =['Classifier', 'Accuracy', 'Precision', 'Recall', 'fit_time', 'score_time']
    metrics_df = pd.DataFrame(columns=col_names[1:], index=np.arange(n_settings))
    scoring = {name:metric for name, metric in zip(col_names[1:], ['accuracy', 'precision_macro', 'recall_macro'])}


    self.data_ = X
    self.labels_ = y
    self.settings_cartesian = pd.Series(list(itertools.product(*[value for _, value in hp_dict.items()])))
    self.settings_keys_ = [key for key, _ in hp_dict.items()]
    self.settings_random_ = self.random_settings(self.settings_cartesian, n_settings, random_seed)
    self.clf_name_ = type(clf).__name__
    self.best_metric_ = 0
    self.best_settings_ = None
    self.best_classifier_ = None

    # START the random search process 
    # use k-fold cross validation to evaluate the baseline performance of different models
    print('Random Search| Model: {}| {}-Fold Cross Validation\n'.format(self.clf_name_ ,cv))
    for n, setting in enumerate(self.settings_random_):
      setting_dict = {key:value for key, value in zip(self.settings_keys_, setting)}

      # initalize the classifier using these settings 
      clf.set_params(**setting_dict)
      cv_results = cross_validate(clf, X, y, scoring=scoring, cv=cv, return_train_score=False)
      print('Settings: {}'.format(setting_dict))
      for result_name, result in cv_results.items():
        result_pattern = re.compile('.*(Accuracy|Precision|Recall)')
        result_name_ = result_pattern.findall(result_name)
        # find the mean value of the metric for all cross validation folds
        result_mean = result.mean()
        # if the result one of the scoring metrics we define, add the result to the metrics_df
        if result_name_:
          metrics_df[result_name_[0]][n] = result_mean
          print('\t{}: {}'.format(result_name_[0], result_mean))
          # find the best metric and classifier
          if result_name_[0].lower() == best_metric:
            if result_mean > self.best_metric_:
              self.best_metric_ = result_mean
              self.best_setting_ = setting_dict
              self.best_classifier_ = clf
        else:
          print('\t{}: {}'.format(result_name, result.mean().round(2)))
          metrics_df[result_name][n] = result_mean
      self.metrics_df_ = metrics_df
      print('='*50)
    # END the random search process


  def random_settings(self, settings_cartesian, n_settings, random_seed):
    '''A function to randomly sample the hyperparameters from n_settings which will be used to build models'''
    # find random index which whill be used to get the hyperparameters
    sampled_index = np.random.choice(np.arange(0, len(settings_cartesian)), n_settings, replace=False)
    return settings_cartesian[sampled_index]  

实验1(预期结果)

当我运行以下代码时,我得到了预期的结果。请注意,我仅使用5个随机设置(超参数的组合)来限制总运行时间。

CODE:
hp_dict = {
           'n_estimators':n_estimators,
           'max_depth':max_depth,
#            'min_samples_split':min_samples_split,
#            'min_samples_leaf':min_samples_leaf,
           'max_features':max_features
          }

rsearch = CustomRandomSearch(X, y, RandomForestClassifier(), hp_dict, 5, random_seed=1)
rsearch.metrics_df_


OUT:
Random Search| Model: RandomForestClassifier| 3-Fold Cross Validation

{'n_estimators': 600, 'max_depth': 1, 'max_features': 21}
Settings: {'n_estimators': 600, 'max_depth': 1, 'max_features': 21}
    fit_time: 0.87
    score_time: 0.83
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================
{'n_estimators': 900, 'max_depth': 6, 'max_features': 24}
Settings: {'n_estimators': 900, 'max_depth': 6, 'max_features': 24}
    fit_time: 3.94
    score_time: 1.35
    Accuracy: 0.5409239034777237
    Precision: 0.024122440405242623
    Recall: 0.018964838417488486
==================================================
{'n_estimators': 500, 'max_depth': 28, 'max_features': 6}
Settings: {'n_estimators': 500, 'max_depth': 28, 'max_features': 6}
    fit_time: 2.82
    score_time: 0.99
    Accuracy: 0.5643549612162997
    Precision: 0.09548924947833685
    Recall: 0.08020801637285205
==================================================
{'n_estimators': 400, 'max_depth': 33, 'max_features': 24}
Settings: {'n_estimators': 400, 'max_depth': 33, 'max_features': 24}
    fit_time: 4.36
    score_time: 0.76
    Accuracy: 0.5645611295019107
    Precision: 0.10931388963435491
    Recall: 0.09537162949195259
==================================================
{'n_estimators': 300, 'max_depth': 22, 'max_features': 6}
Settings: {'n_estimators': 300, 'max_depth': 22, 'max_features': 6}
    fit_time: 1.59
    score_time: 0.61
    Accuracy: 0.5598910737210692
    Precision: 0.10028259405181572  Recall: 0.0844580862573192
==================================================


    Accuracy    Precision   Recall  fit_time    score_time
0   0.523667    0.00634161  0.0119674   0.836494    0.810416
1   0.540924    0.0241224   0.0189648   3.8934  1.38452
2   0.564355    0.0954892   0.080208    2.61455 0.97889
3   0.564561    0.109314    0.0953716   4.24907 0.762426
4   0.559891    0.100283    0.0844581   1.54386 0.589665

我的实施工作完全符合预期。当使用任何未注释的超级参数或其组合时,它工作正常。所谓“效果很好”,是指我的准确度值是不同的。

实验2(奇怪的结果)

当我尝试使用网格中的 min_samples_split min_samples_leaf 调整参数网格时(即使单独使用而没有其他参数),所有精度值都将变为即使其他超级参数差异很大,也是如此。 注意,我已经用 RandomizedSearchCV 进行了检查,似乎正在发生同样的事情。

CODE:
hp_dict = {
           'n_estimators':n_estimators,
           'max_depth':max_depth,
           'min_samples_split':min_samples_split,
           'min_samples_leaf':min_samples_leaf,
           'max_features':max_features
          }

rsearch = CustomRandomSearch(X, y, RandomForestClassifier(), hp_dict, 5, random_seed=1)
rsearch.metrics_df_


OUT:
Random Search| Model: RandomForestClassifier| 3-Fold Cross Validation

{'n_estimators': 500, 'max_depth': 17, 'min_samples_split': 0.7000000000000001, 'min_samples_leaf': 0.33999999999999997, 'max_features': 1}
Settings: {'n_estimators': 500, 'max_depth': 17, 'min_samples_split': 0.7000000000000001, 'min_samples_leaf': 0.33999999999999997, 'max_features': 1}
    fit_time: 0.51
    score_time: 0.71
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================
{'n_estimators': 900, 'max_depth': 39, 'min_samples_split': 0.30000000000000004, 'min_samples_leaf': 0.38, 'max_features': 24}
Settings: {'n_estimators': 900, 'max_depth': 39, 'min_samples_split': 0.30000000000000004, 'min_samples_leaf': 0.38, 'max_features': 24}
    fit_time: 0.86
    score_time: 1.25
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================
{'n_estimators': 200, 'max_depth': 22, 'min_samples_split': 0.4, 'min_samples_leaf': 0.14, 'max_features': 21}
Settings: {'n_estimators': 200, 'max_depth': 22, 'min_samples_split': 0.4, 'min_samples_leaf': 0.14, 'max_features': 21}
    fit_time: 0.3
    score_time: 0.28
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================
{'n_estimators': 900, 'max_depth': 6, 'min_samples_split': 1.0, 'min_samples_leaf': 0.22, 'max_features': 6}
Settings: {'n_estimators': 900, 'max_depth': 6, 'min_samples_split': 1.0, 'min_samples_leaf': 0.22, 'max_features': 6}
    fit_time: 0.83
    score_time: 1.21
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================
{'n_estimators': 1000, 'max_depth': 28, 'min_samples_split': 0.4, 'min_samples_leaf': 0.18, 'max_features': 3}
Settings: {'n_estimators': 1000, 'max_depth': 28, 'min_samples_split': 0.4, 'min_samples_leaf': 0.18, 'max_features': 3}
    fit_time: 1.05
    score_time: 1.35
    Accuracy: 0.5236672225712568
    Precision: 0.006341609686704505
    Recall: 0.01196739485509539
==================================================


Accuracy    Precision   Recall  fit_time    score_time
0   0.523667    0.00634161  0.0119674   0.506414    0.714159
1   0.523667    0.00634161  0.0119674   0.860566    1.25096
2   0.523667    0.00634161  0.0119674   0.302913    0.28406
3   0.523667    0.00634161  0.0119674   0.834416    1.20777
4   0.523667    0.00634161  0.0119674   1.04618 1.35358

有人,请帮助我了解这到底是怎么回事?

0 个答案:

没有答案