我是机器学习和python的新手。现在我尝试应用随机森林来预测目标的二进制结果。在我的数据中,我有24个预测变量(1000个观测值),其中一个是分类(性别),所有其他都是数字。在数字中,有两种类型的值,即以欧元(非常倾斜和缩放)和数字(来自atm的交易数量)的货币量。我已经改变了大规模的功能并做了归责。最后,我检查了相关性和共线性,并基于删除了一些功能(因此我有24个功能。)现在,当我实施RF时,它在训练集中总是完美的,而根据交叉验证,比率不太好。即使在测试装置中应用它,它也会提供非常低的召回值。我应该如何解决这个问题?
def classification_model(model, data, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])
# Make predictions on training set:
predictions = model.predict(data[predictors])
# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train, :])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
# Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)
答案 0 :(得分:0)
在随机森林中,它很容易过度配合。要解决此问题,您需要更严格地进行参数搜索,以了解要使用的最佳参数。 [这里](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html )是关于如何做到这一点的链接:(来自scikit doc)。
过度拟合,您需要搜索可在模型上工作的最佳参数。该链接为网格和随机搜索提供超参数估计的实现。 通过麻省理工学院的人工智能讲座来获得深刻的理论指导也很有趣:https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s。
希望这有帮助!