Question

我有4个变量作为随机森林的输入。即[＆＃39; superType1＆＃39;，＆＃39; superType2＆＃39;，＆＃39; superType3＆＃39;，＆＃39; superTypeProbability＆＃39;]。前三列是维基数据项目ID，最后一列是概率列。我使用GridSearchCV进行最佳参数选择。然而，尽管使用了所有可能的选项，但这种模式严重过度。功能＆＃39; superTypeProbability＆＃39;在这里过度配备。但是，我想使用此功能，因为在我的情况下，这是唯一可以提高RF性能的参数。

roc_auc_score_train:0.994399847095
roc_auc_score_validation:0.402392359246

仅使用＆＃39; superTypeProbability＆＃39;具有逻辑回归的特征给出ROC如下：

roc_auc_score for only superTypeProbability feature:0.762852724493
roc_auc_score for only superTypeProbability feature:0.691760825723

仅使用前3个功能，RF给出了ROC：

roc_auc_score_train:0.974928760078
roc_auc_score_validation:0.790185294454

我的RF代码如下：

def train_model(self):
        logger.info("Using random forest classifier......")
        train = self.feature_preprocessing(self.train)
        X_train = pd.DataFrame(data=train, columns=['superType1', 'superType2', 'superType3'])
        logger.info("Using features: %s", X_train.columns)
        y_train = train['ROLLBACK_REVERTED']

        rfc = RandomForestClassifier(n_jobs=-1, max_features=None, n_estimators=1000, oob_score=True,
                                     random_state=50, min_samples_leaf=1, max_depth=9)

        param_grid = {
            'n_estimators': [500, 600, 700, 800],
            'max_depth': [8, 9, 10, 11],
            'min_samples_leaf': [1],
        }

        search = sklearn.grid_search.GridSearchCV(rfc, param_grid, n_jobs=-1, verbose=0, scoring='roc_auc', cv=3)
        search.fit(X_train, y_train)

        logger.info("All Scores: %s", search.grid_scores_)
        logger.info("Best Score: %s", search.best_score_)
        logger.info("Best Params: %s", search.best_params_)

        predictedProbVal = search.predict_proba(X_train)
        roc_auc_score_train = metrics.roc_auc_score(y_train, predictedProbVal[:, 1])
        logger.info("roc_auc_score_train:%s", roc_auc_score_train)

        validationProb = self.predict_probabilities(search)
        return validationProb

    def predict_probabilities(self, rfModel):
        validation = self.feature_preprocessing(self.validation)
        X_val = pd.DataFrame(data=validation, columns=['superType1', 'superType2', 'superType3', 'superTypeProbability'])
        y_val = validation['ROLLBACK_REVERTED']

        # Predict the result for test data
        predictedProbVal = rfModel.predict_proba(X_val)
        validation['vandalismScore'] = pd.DataFrame(predictedProbVal[:, 1])
        roc_auc_score_val = metrics.roc_auc_score(y_val, predictedProbVal[:, 1])
        logger.info("roc_auc_score_validation:%s", roc_auc_score_val)
        return validation

Answer 1

问题发生在superTypeProbability。我通过在superTypeProbability功能中进行了一些更改来解决它。现在ROC = 0.82。在计算superTypeProbability之前，我正在计算typeProbability功能。将typeProbability与随机森林一起使用时，ROC = 0.74。我想改善这个结果。此功能有几个NaN值，例如，500个中的NaN为1000个。为了减少此数字，我推导了新功能superTypeProbability。如果同时存在typeProbability和superTypeProbability，则会将更高的值分配给superTypeProbability。有了这个，superTypeProbability的NaN值就会减少1000个中的300个。现在，为了填补这个NaN值，我用平均superTypeProbability 值代替较小< / strong>比 average typeProbability 值。这导致了这个问题。所以我现在正在使用填充NaN

的平均类型概率
features['superTypeProbability'] = features['superTypeProbability'].fillna(features['typeProbability'][features.typeProbability!='None'].mean())

随机森林分类器严重过度

1 个答案: