Question

我是机器学习和python的新手。现在我尝试应用随机森林来预测目标的二进制结果。在我的数据中，我有24个预测变量（1000个观测值），其中一个是分类（性别），所有其他都是数字。在数字中，有两种类型的值，即以欧元（非常倾斜和缩放）和数字（来自atm的交易数量）的货币量。我已经改变了大规模的功能并做了归责。最后，我检查了相关性和共线性，并基于删除了一些功能（因此我有24个功能。）现在，当我实施RF时，它在训练集中总是完美的，而根据交叉验证，比率不太好。即使在测试装置中应用它，它也会提供非常低的召回值。我应该如何解决这个问题？

def classification_model(model, data, predictors, outcome):
    # Fit the model:
    model.fit(data[predictors], data[outcome])

    # Make predictions on training set:
    predictions = model.predict(data[predictors])

    # Print accuracy
    accuracy = metrics.accuracy_score(predictions, data[outcome])
    print("Accuracy : %s" % "{0:.3%}".format(accuracy))

    # Perform k-fold cross-validation with 5 folds
    kf = KFold(data.shape[0], n_folds=5)
    error = []
    for train, test in kf:
        # Filter training data
        train_predictors = (data[predictors].iloc[train, :])

        # The target we're using to train the algorithm.
        train_target = data[outcome].iloc[train]

        # Training the algorithm using the predictors and target.
        model.fit(train_predictors, train_target)

        # Record error from each cross-validation run
        error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))

    print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

    # Fit the model again so that it can be refered outside the function:
    model.fit(data[predictors], data[outcome])



outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)

#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)

outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)

Answer 1

在随机森林中，它很容易过度配合。要解决此问题，您需要更严格地进行参数搜索，以了解要使用的最佳参数。 [这里]（http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html ）是关于如何做到这一点的链接:(来自scikit doc）。

过度拟合，您需要搜索可在模型上工作的最佳参数。该链接为网格和随机搜索提供超参数估计的实现。通过麻省理工学院的人工智能讲座来获得深刻的理论指导也很有趣：https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s。

希望这有帮助！

如何正确应用随机森林？

1 个答案: