Question

我尝试执行GridSearchCV来优化分类器的超参数，这应该通过优化自定义评分函数来完成。问题是，评分函数是按特定成本分配的，每个实例的成本不同（成本也是每个实例的一个特征）。如下面的示例所示，需要另一个数组 test_amt 来保存每个实例的成本（除了正常的＆＃39;得分函数 y 和 y_pred 。

    def calculate_costs(y_test, y_test_pred, test_amt):
    cost = 0

    for i in range(1, len(y_test)):
        y = y_test.iloc[i]
        y_pred = y_test_pred.iloc[i]
        x_amt = test_amt.iloc[i]

        if y == 0 and y_pred == 0:
            cost -= x_amt * 1.1
        elif y == 0 and y_pred == 1:
            cost += x_amt
        elif y == 1 and y_pred == 0:
            cost += x_amt * 1.1
        elif y == 1 and y_pred == 1:
            cost += 0
        else:
            print("ERROR! No cost could be assigned to the instance: " + str(i))
    return cost

当我使用三个阵列训练后调用此函数时，它可以完美地计算模型产生的总成本。但是将其集成到 GridSearchCV 中很困难，因为评分函数只需要两个参数。虽然有可能将其他 kwargs 传递给scorer，但我不清楚如何传递依赖于 GridSearchCV 的分割的子集目前正在努力。

到目前为止我所拥有的/尝试过：

将整个管道包装在一个具有全局存储的pandas.Series对象的类中，该对象使用索引存储每个实例的开销。然后，理论上可以通过使用相同的索引调用实例来引用实例的开销。不幸的是，这不起作用，因为scikit learn将所有内容转换为numpy数组。

def calculate_costs_class(y_test, y_test_pred):
cost = 0
for index, _ in y_test.iteritems():
    y = y_test.loc[index]
    y_pred = y_test_pred.loc[index]
    x_amt = self.test_amt.loc[index]

    if y == 0 and y_pred == 0:
        cost += (x_amt * (-1)) + 5 + (x_amt * 0.1)  # -revenue, +shipping, +fees
    elif y == 0 and y_pred == 1:
        cost += x_amt  # +revenue
    elif y == 1 and y_pred == 0:
        cost += x_amt + 5 + (x_amt * 0.1) + 5  # +revenue, +shipping, +fees, +charge cost
    elif y == 1 and y_pred == 1:
        cost += 0  # nothing
    else:
        print("ERROR! No cost could be assigned to the instance: " + str(index))
return cost

创建自定义 PseudoInt 类，即标签的数据类型，它继承了 int 的所有属性，但也能够存储成本实例（同时保留其所有属性以应用逻辑运算）。虽然这可以在Scikit Learn之外使用，但scikit中的 check_classification_targets 方法会引发 ValueError：未知标签类型：＆＃39;未知＆＃39; 错误。
```
class PseudoInt(int):
    def __new__(cls, x, cost, *args, **kwargs):
        instance = int.__new__(cls, x, *args, **kwargs)
        instance.cost = cost
        return instance
```
我还没试过但想过：由于费用也是实例集 X 中的一项功能，因此在 __ call__中也可以使用 如果我重新编程调用函数以将成本数组作为 X 的子集传递给 score_func ，我也会有成本。
或者：我可以自己实施一切。

你们有没有更容易的＆＃34;解决方案？谢谢！

Answer 1

我找到了一种解决问题的方法，通过第二个提出的答案的路径：将PseudoInteger传递给Scikit-Learn，当比较或完成数学运算时，它具有与普通整数相同的所有属性。但是，它也充当int的包装器，并且还可以存储实例变量（例如实例的成本）。正如问题中已经说明的那样，这会导致Scikit学习识别传递的标签数组中的值实际上是 object 类型而不是 int 。所以我刚刚在第273行的Scikit-Learn的 multiclass.py 的 type_of_target（y）方法中替换了测试，以返回'binary'虽然它没有通过测试。因此，Scikit-Learn只是将整个问题（应该是）视为二元分类问题。因此， multiclass.py 中 type_of_target（y）方法中的第269-273行现在看起来像：

# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
                  not isinstance(y.flat[0], string_types)):
    # return 'unknown'  # [[[1, 2]]] or [obj_1] and not ["label_1"]
    return 'binary' # Sneaky, modified to force binary classification.

我的代码看起来像这样：

import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer


class PseudoInt(int):
    # Behaves like an integer, but is able to store instance variables
    pass


def grid_search(x, y_normal, x_amounts):
    # Change the label set to a np array containing pseudo ints with the costs associated with the instances
    y = np.empty(len(y_normal), dtype=PseudoInt)
    for index, value in y_normal.iteritems():
        new_int = PseudoInt(value)
        new_int.cost = x_amounts.loc[index]  # Here the cost is added to the label
        y[index] = new_int

    # Normal train test split
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

    # Classifier
    clf = sklearn.tree.DecisionTreeClassifier()

    # Custom scorer with the cost function below (lower cost is better)
    cost_scorer = make_scorer(cost_function, greater_is_better=False)  # Custom cost function (Lower cost is better)

    # Define pipeline
    pipe = Pipeline([('clf', clf)])

    # Grid search grid with any hyper parameters or other settings
    param_grid = [
        {'sfs__estimator__criterion': ['gini', 'entropy']}
    ]

    # Grid search and pass the custom scorer function
    gs = GridSearchCV(estimator=pipe,
                      param_grid=param_grid,
                      scoring=cost_scorer,
                      n_jobs=1,
                      cv=5,
                      refit=True)

    # run grid search and refit with best hyper parameters
    gs = gs.fit(x_train.as_matrix(), y_train)
    print("Best Parameters: " + str(gs.best_params_))
    print('Best Accuracy: ' + str(gs.best_score_))

    # Predict with retrained model (with best parameters)
    y_test_pred = gs.predict(x_test.as_matrix())

    # Get scores (also cost score)
    get_scores(y_test, y_test_pred)


def get_scores(y_test, y_test_pred):
    print("Getting scores")

    print("SCORES")
    precision = sklearn.metrics.precision_score(y_test, y_test_pred)
    recall = sklearn.metrics.recall_score(y_test, y_test_pred)
    f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
    print("Precision      " + str(precision))
    print("Recall         " + str(recall))
    print("Accuracy       " + str(accuracy))
    print("F1_Score       " + str(f1_score))

    print("COST")
    cost = cost_function(y_test, y_test_pred)
    print("Cost Savings   " + str(-cost))

    print("CONFUSION MATRIX")
    cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
    cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
    print(cnf_matrix)


def cost_function(y_test, y_test_pred):
    """
    Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
    :param y_test: Has to be an array of PseudoInts containing the cost of each instance
    :param y_test_pred: Any array of PseudoInts or ints
    :return: Returns total cost
    """
    cost = 0

    for index in range(len(y_test)):
        # print(index)
        y = y_test[index]
        y_pred = y_test_pred[index]
        x_amt = y.cost

        if y == 0 and y_pred == 0:
            cost -= x_amt # Reducing cot by x_amt
        elif y == 0 and y_pred == 1:
            cost += x_amt  # Wrong classification adds cost
        elif y == 1 and y_pred == 0:
            cost += x_amt + 5 # Wrong classification adds cost and fee
        elif y == 1 and y_pred == 1:
            cost += 0  # No cost
        else:
            raise ValueError("No cost could be assigned to the instance: " + str(index))

    # print("Cost: " + str(cost))
    return cost

更新

我没有直接更改包中的文件（这有点脏），而是添加到我项目的第一个导入行中：

import sklearn.utils.multiclass

def return_binary(y):
    return "binary"

sklearn.utils.multiclass.type_of_target = return_binary

这会覆盖 sklearn.utils.multiclass 中的 type_of_tartget（y）方法，以便始终返回 binary 。请注意，他必须在所有其他sklearn-imports之前。

如何在scikit中创建自定义评分函数 - 学习如何根据各自的属性对一组实例进行评分？

1 个答案:

更新