二元分类目标专门针对误报

时间:2017-08-30 11:06:12

标签: optimization machine-learning scikit-learn classification

使用sklearn模型时我有点困惑,如何设置特定的优化功能?例如,当使用RandomForestClassifier时,我如何让模型知道'我希望最大限度地回忆起#39;或者' F1得分'。或者' AUC'而不是'准​​确性?

有什么建议吗?谢谢。

2 个答案:

答案 0 :(得分:3)

What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:

Exhaustive Grid Search In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :

string size is 1000, loop size is 10000, array costs 10 ms
string size is 1000, loop size is 100000, array costs 60 ms
string size is 1000, loop size is 1000000, array costs 495 ms
string size is 1000, loop size is 10000, charAt() costs 14 ms
string size is 1000, loop size is 100000, charAt() costs 184 ms
string size is 1000, loop size is 1000000, charAt() costs 1649 ms

string size is 5000, loop size is 10000, array costs 23 ms
string size is 5000, loop size is 100000, array costs 232 ms
string size is 5000, loop size is 1000000, array costs 2277 ms
string size is 5000, loop size is 10000, charAt() costs 82 ms
string size is 5000, loop size is 100000, charAt() costs 828 ms
string size is 5000, loop size is 1000000, charAt() costs 8202 ms

string size is 10000, loop size is 10000, array costs 44 ms
string size is 10000, loop size is 100000, array costs 458 ms
string size is 10000, loop size is 1000000, array costs 4559 ms
string size is 10000, loop size is 10000, charAt() costs 166 ms
string size is 10000, loop size is 100000, charAt() costs 1626 ms
string size is 10000, loop size is 1000000, charAt() costs 16280 ms

In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

An example for using GridSearch :

public char charAt(int index) {
    if ((index < 0) || (index >= value.length)) {
        throw new StringIndexOutOfBoundsException(index);
    }
    return value[index];
}

You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.

Randomized Search Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:

RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.

You can read more about it from here.

You can read more about other approaches here.

Alternative link for reference:

Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.

If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.

答案 1 :(得分:-2)

我建议你拿一杯咖啡阅读(并理解)以下内容

http://scikit-learn.org/stable/modules/model_evaluation.html

你需要使用

的内容
cross_val_score(model, X, y, scoring='f1')

可能的选择是(检查文档)

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 
'average_precision', 'completeness_score', 'explained_variance', 
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 
'neg_mean_squared_log_error', 'neg_median_absolute_error', 
'normalized_mutual_info_score', 'precision', 'precision_macro', 
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 
'recall', 'recall_macro', 'recall_micro', 'recall_samples', 
'recall_weighted', 'roc_auc', 'v_measure_score']

玩得开心 翁