如何在scikit中创建自定义评分函数 - 学习如何根据各自的属性对一组实例进行评分?

时间:2018-01-26 19:18:20

标签: python numpy scikit-learn decision-tree scoring

我尝试执行GridSearchCV来优化分类器的超参数,这应该通过优化自定义评分函数来完成。问题是,评分函数是按特定成本分配的,每个实例的成本不同(成本也是每个实例的一个特征)。如下面的示例所示,需要另一个数组 test_amt 来保存每个实例的成本(除了正常的'得分函数 y y_pred

    def calculate_costs(y_test, y_test_pred, test_amt):
    cost = 0

    for i in range(1, len(y_test)):
        y = y_test.iloc[i]
        y_pred = y_test_pred.iloc[i]
        x_amt = test_amt.iloc[i]

        if y == 0 and y_pred == 0:
            cost -= x_amt * 1.1
        elif y == 0 and y_pred == 1:
            cost += x_amt
        elif y == 1 and y_pred == 0:
            cost += x_amt * 1.1
        elif y == 1 and y_pred == 1:
            cost += 0
        else:
            print("ERROR! No cost could be assigned to the instance: " + str(i))
    return cost

当我使用三个阵列训练后调用此函数时,它可以完美地计算模型产生的总成本。但是将其集成到 GridSearchCV 中很困难,因为评分函数只需要两个参数。虽然有可能将其他 kwargs 传递给scorer,但我不清楚如何传递依赖于 GridSearchCV 的分割的子集目前正在努力。

到目前为止我所拥有的/尝试过:

  1. 将整个管道包装在一个具有全局存储的pandas.Series对象的类中,该对象使用索引存储每个实例的开销。然后,理论上可以通过使用相同的索引调用实例来引用实例的开销。不幸的是,这不起作用,因为scikit learn将所有内容转换为numpy数组。

    def calculate_costs_class(y_test, y_test_pred):
    cost = 0
    for index, _ in y_test.iteritems():
        y = y_test.loc[index]
        y_pred = y_test_pred.loc[index]
        x_amt = self.test_amt.loc[index]
    
        if y == 0 and y_pred == 0:
            cost += (x_amt * (-1)) + 5 + (x_amt * 0.1)  # -revenue, +shipping, +fees
        elif y == 0 and y_pred == 1:
            cost += x_amt  # +revenue
        elif y == 1 and y_pred == 0:
            cost += x_amt + 5 + (x_amt * 0.1) + 5  # +revenue, +shipping, +fees, +charge cost
        elif y == 1 and y_pred == 1:
            cost += 0  # nothing
        else:
            print("ERROR! No cost could be assigned to the instance: " + str(index))
    return cost
    
  2. 创建自定义 PseudoInt 类,即标签的数据类型,它继承了 int 的所有属性,但也能够存储成本实例(同时保留其所有属性以应用逻辑运算)。虽然这可以在Scikit Learn之外使用,但scikit中的 check_classification_targets 方法会引发 ValueError:未知标签类型:'未知' 错误。

    class PseudoInt(int):
        def __new__(cls, x, cost, *args, **kwargs):
            instance = int.__new__(cls, x, *args, **kwargs)
            instance.cost = cost
            return instance
    
  3. 我还没试过但想过:由于费用也是实例集 X 中的一项功能,因此在 __ call__中也可以使用 如果我重新编程调用函数以将成本数组作为 X 的子集传递给 score_func ,我也会有成本。

  4. 或者:我可以自己实施一切。

  5. 你们有没有更容易的"解决方案?谢谢!

1 个答案:

答案 0 :(得分:0)

我找到了一种解决问题的方法,通过第二个提出的答案的路径:将PseudoInteger传递给Scikit-Learn,当比较或完成数学运算时,它具有与普通整数相同的所有属性。但是,它也充当int的包装器,并且还可以存储实例变量(例如实例的成本)。正如问题中已经说明的那样,这会导致Scikit学习识别传递的标签数组中的值实际上是 object 类型而不是 int 。所以我刚刚在第273行的Scikit-Learn的 multiclass.py type_of_target(y)方法中替换了测试,以返回'binary'虽然它没有通过测试。因此,Scikit-Learn只是将整个问题(应该是)视为二元分类问题。因此, multiclass.py type_of_target(y)方法中的第269-273行现在看起来像:

# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
                  not isinstance(y.flat[0], string_types)):
    # return 'unknown'  # [[[1, 2]]] or [obj_1] and not ["label_1"]
    return 'binary' # Sneaky, modified to force binary classification.

我的代码看起来像这样:

import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer


class PseudoInt(int):
    # Behaves like an integer, but is able to store instance variables
    pass


def grid_search(x, y_normal, x_amounts):
    # Change the label set to a np array containing pseudo ints with the costs associated with the instances
    y = np.empty(len(y_normal), dtype=PseudoInt)
    for index, value in y_normal.iteritems():
        new_int = PseudoInt(value)
        new_int.cost = x_amounts.loc[index]  # Here the cost is added to the label
        y[index] = new_int

    # Normal train test split
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

    # Classifier
    clf = sklearn.tree.DecisionTreeClassifier()

    # Custom scorer with the cost function below (lower cost is better)
    cost_scorer = make_scorer(cost_function, greater_is_better=False)  # Custom cost function (Lower cost is better)

    # Define pipeline
    pipe = Pipeline([('clf', clf)])

    # Grid search grid with any hyper parameters or other settings
    param_grid = [
        {'sfs__estimator__criterion': ['gini', 'entropy']}
    ]

    # Grid search and pass the custom scorer function
    gs = GridSearchCV(estimator=pipe,
                      param_grid=param_grid,
                      scoring=cost_scorer,
                      n_jobs=1,
                      cv=5,
                      refit=True)

    # run grid search and refit with best hyper parameters
    gs = gs.fit(x_train.as_matrix(), y_train)
    print("Best Parameters: " + str(gs.best_params_))
    print('Best Accuracy: ' + str(gs.best_score_))

    # Predict with retrained model (with best parameters)
    y_test_pred = gs.predict(x_test.as_matrix())

    # Get scores (also cost score)
    get_scores(y_test, y_test_pred)


def get_scores(y_test, y_test_pred):
    print("Getting scores")

    print("SCORES")
    precision = sklearn.metrics.precision_score(y_test, y_test_pred)
    recall = sklearn.metrics.recall_score(y_test, y_test_pred)
    f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
    print("Precision      " + str(precision))
    print("Recall         " + str(recall))
    print("Accuracy       " + str(accuracy))
    print("F1_Score       " + str(f1_score))

    print("COST")
    cost = cost_function(y_test, y_test_pred)
    print("Cost Savings   " + str(-cost))

    print("CONFUSION MATRIX")
    cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
    cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
    print(cnf_matrix)


def cost_function(y_test, y_test_pred):
    """
    Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
    :param y_test: Has to be an array of PseudoInts containing the cost of each instance
    :param y_test_pred: Any array of PseudoInts or ints
    :return: Returns total cost
    """
    cost = 0

    for index in range(len(y_test)):
        # print(index)
        y = y_test[index]
        y_pred = y_test_pred[index]
        x_amt = y.cost

        if y == 0 and y_pred == 0:
            cost -= x_amt # Reducing cot by x_amt
        elif y == 0 and y_pred == 1:
            cost += x_amt  # Wrong classification adds cost
        elif y == 1 and y_pred == 0:
            cost += x_amt + 5 # Wrong classification adds cost and fee
        elif y == 1 and y_pred == 1:
            cost += 0  # No cost
        else:
            raise ValueError("No cost could be assigned to the instance: " + str(index))

    # print("Cost: " + str(cost))
    return cost

更新

我没有直接更改包中的文件(这有点脏),而是添加到我项目的第一个导入行中:

import sklearn.utils.multiclass

def return_binary(y):
    return "binary"

sklearn.utils.multiclass.type_of_target = return_binary

这会覆盖 sklearn.utils.multiclass 中的 type_of_tartget(y)方法,以便始终返回 binary 。请注意,他必须在所有其他sklearn-imports之前。