如何正确应用随机森林?

时间:2017-05-23 16:49:59

标签: python-3.x scikit-learn random-forest

我是机器学习和python的新手。现在我尝试应用随机森林来预测目标的二进制结果。在我的数据中,我有24个预测变量(1000个观测值),其中一个是分类(性别),所有其他都是数字。在数字中,有两种类型的值,即以欧元(非常倾斜和缩放)和数字(来自atm的交易数量)的货币量。我已经改变了大规模的功能并做了归责。最后,我检查了相关性和共线性,并基于删除了一些功能(因此我有24个功能。)现在,当我实施RF时,它在训练集中总是完美的,而根据交叉验证,比率不太好。即使在测试装置中应用它,它也会提供非常低的召回值。我应该如何解决这个问题?

def classification_model(model, data, predictors, outcome):
    # Fit the model:
    model.fit(data[predictors], data[outcome])

    # Make predictions on training set:
    predictions = model.predict(data[predictors])

    # Print accuracy
    accuracy = metrics.accuracy_score(predictions, data[outcome])
    print("Accuracy : %s" % "{0:.3%}".format(accuracy))

    # Perform k-fold cross-validation with 5 folds
    kf = KFold(data.shape[0], n_folds=5)
    error = []
    for train, test in kf:
        # Filter training data
        train_predictors = (data[predictors].iloc[train, :])

        # The target we're using to train the algorithm.
        train_target = data[outcome].iloc[train]

        # Training the algorithm using the predictors and target.
        model.fit(train_predictors, train_target)

        # Record error from each cross-validation run
        error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))

    print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

    # Fit the model again so that it can be refered outside the function:
    model.fit(data[predictors], data[outcome])



outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)

#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)

outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var) 

1 个答案:

答案 0 :(得分:0)

在随机森林中,它很容易过度配合。要解决此问题,您需要更严格地进行参数搜索,以了解要使用的最佳参数。 [这里](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html )是关于如何做到这一点的链接:(来自scikit doc)。

过度拟合,您需要搜索可在模型上工作的最佳参数。该链接为网格和随机搜索提供超参数估计的实现。 通过麻省理工学院的人工智能讲座来获得深刻的理论指导也很有趣:https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s

希望这有帮助!