我尝试执行GridSearchCV来优化分类器的超参数,这应该通过优化自定义评分函数来完成。问题是,评分函数是按特定成本分配的,每个实例的成本不同(成本也是每个实例的一个特征)。如下面的示例所示,需要另一个数组 test_amt 来保存每个实例的成本(除了正常的'得分函数 y 和 y_pred 。
def calculate_costs(y_test, y_test_pred, test_amt):
cost = 0
for i in range(1, len(y_test)):
y = y_test.iloc[i]
y_pred = y_test_pred.iloc[i]
x_amt = test_amt.iloc[i]
if y == 0 and y_pred == 0:
cost -= x_amt * 1.1
elif y == 0 and y_pred == 1:
cost += x_amt
elif y == 1 and y_pred == 0:
cost += x_amt * 1.1
elif y == 1 and y_pred == 1:
cost += 0
else:
print("ERROR! No cost could be assigned to the instance: " + str(i))
return cost
当我使用三个阵列训练后调用此函数时,它可以完美地计算模型产生的总成本。但是将其集成到 GridSearchCV 中很困难,因为评分函数只需要两个参数。虽然有可能将其他 kwargs 传递给scorer,但我不清楚如何传递依赖于 GridSearchCV 的分割的子集目前正在努力。
到目前为止我所拥有的/尝试过:
将整个管道包装在一个具有全局存储的pandas.Series对象的类中,该对象使用索引存储每个实例的开销。然后,理论上可以通过使用相同的索引调用实例来引用实例的开销。不幸的是,这不起作用,因为scikit learn将所有内容转换为numpy数组。
def calculate_costs_class(y_test, y_test_pred):
cost = 0
for index, _ in y_test.iteritems():
y = y_test.loc[index]
y_pred = y_test_pred.loc[index]
x_amt = self.test_amt.loc[index]
if y == 0 and y_pred == 0:
cost += (x_amt * (-1)) + 5 + (x_amt * 0.1) # -revenue, +shipping, +fees
elif y == 0 and y_pred == 1:
cost += x_amt # +revenue
elif y == 1 and y_pred == 0:
cost += x_amt + 5 + (x_amt * 0.1) + 5 # +revenue, +shipping, +fees, +charge cost
elif y == 1 and y_pred == 1:
cost += 0 # nothing
else:
print("ERROR! No cost could be assigned to the instance: " + str(index))
return cost
创建自定义 PseudoInt 类,即标签的数据类型,它继承了 int 的所有属性,但也能够存储成本实例(同时保留其所有属性以应用逻辑运算)。虽然这可以在Scikit Learn之外使用,但scikit中的 check_classification_targets 方法会引发 ValueError:未知标签类型:'未知' 错误。
class PseudoInt(int):
def __new__(cls, x, cost, *args, **kwargs):
instance = int.__new__(cls, x, *args, **kwargs)
instance.cost = cost
return instance
我还没试过但想过:由于费用也是实例集 X 中的一项功能,因此在 __ call__中也可以使用 如果我重新编程调用函数以将成本数组作为 X 的子集传递给 score_func ,我也会有成本。
或者:我可以自己实施一切。
你们有没有更容易的"解决方案?谢谢!
答案 0 :(得分:0)
我找到了一种解决问题的方法,通过第二个提出的答案的路径:将PseudoInteger传递给Scikit-Learn,当比较或完成数学运算时,它具有与普通整数相同的所有属性。但是,它也充当int的包装器,并且还可以存储实例变量(例如实例的成本)。正如问题中已经说明的那样,这会导致Scikit学习识别传递的标签数组中的值实际上是 object 类型而不是 int 。所以我刚刚在第273行的Scikit-Learn的 multiclass.py 的 type_of_target(y)方法中替换了测试,以返回'binary'虽然它没有通过测试。因此,Scikit-Learn只是将整个问题(应该是)视为二元分类问题。因此, multiclass.py 中 type_of_target(y)方法中的第269-273行现在看起来像:
# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
not isinstance(y.flat[0], string_types)):
# return 'unknown' # [[[1, 2]]] or [obj_1] and not ["label_1"]
return 'binary' # Sneaky, modified to force binary classification.
我的代码看起来像这样:
import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer
class PseudoInt(int):
# Behaves like an integer, but is able to store instance variables
pass
def grid_search(x, y_normal, x_amounts):
# Change the label set to a np array containing pseudo ints with the costs associated with the instances
y = np.empty(len(y_normal), dtype=PseudoInt)
for index, value in y_normal.iteritems():
new_int = PseudoInt(value)
new_int.cost = x_amounts.loc[index] # Here the cost is added to the label
y[index] = new_int
# Normal train test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
# Classifier
clf = sklearn.tree.DecisionTreeClassifier()
# Custom scorer with the cost function below (lower cost is better)
cost_scorer = make_scorer(cost_function, greater_is_better=False) # Custom cost function (Lower cost is better)
# Define pipeline
pipe = Pipeline([('clf', clf)])
# Grid search grid with any hyper parameters or other settings
param_grid = [
{'sfs__estimator__criterion': ['gini', 'entropy']}
]
# Grid search and pass the custom scorer function
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring=cost_scorer,
n_jobs=1,
cv=5,
refit=True)
# run grid search and refit with best hyper parameters
gs = gs.fit(x_train.as_matrix(), y_train)
print("Best Parameters: " + str(gs.best_params_))
print('Best Accuracy: ' + str(gs.best_score_))
# Predict with retrained model (with best parameters)
y_test_pred = gs.predict(x_test.as_matrix())
# Get scores (also cost score)
get_scores(y_test, y_test_pred)
def get_scores(y_test, y_test_pred):
print("Getting scores")
print("SCORES")
precision = sklearn.metrics.precision_score(y_test, y_test_pred)
recall = sklearn.metrics.recall_score(y_test, y_test_pred)
f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
print("Precision " + str(precision))
print("Recall " + str(recall))
print("Accuracy " + str(accuracy))
print("F1_Score " + str(f1_score))
print("COST")
cost = cost_function(y_test, y_test_pred)
print("Cost Savings " + str(-cost))
print("CONFUSION MATRIX")
cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
print(cnf_matrix)
def cost_function(y_test, y_test_pred):
"""
Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
:param y_test: Has to be an array of PseudoInts containing the cost of each instance
:param y_test_pred: Any array of PseudoInts or ints
:return: Returns total cost
"""
cost = 0
for index in range(len(y_test)):
# print(index)
y = y_test[index]
y_pred = y_test_pred[index]
x_amt = y.cost
if y == 0 and y_pred == 0:
cost -= x_amt # Reducing cot by x_amt
elif y == 0 and y_pred == 1:
cost += x_amt # Wrong classification adds cost
elif y == 1 and y_pred == 0:
cost += x_amt + 5 # Wrong classification adds cost and fee
elif y == 1 and y_pred == 1:
cost += 0 # No cost
else:
raise ValueError("No cost could be assigned to the instance: " + str(index))
# print("Cost: " + str(cost))
return cost
我没有直接更改包中的文件(这有点脏),而是添加到我项目的第一个导入行中:
import sklearn.utils.multiclass
def return_binary(y):
return "binary"
sklearn.utils.multiclass.type_of_target = return_binary
这会覆盖 sklearn.utils.multiclass 中的 type_of_tartget(y)方法,以便始终返回 binary 。请注意,他必须在所有其他sklearn-imports之前。