在我的数据中,大约有70个课程,我正在使用lightGBM来预测正确的课程标签。
在R中,我希望有一个自定义的“度量”功能,在这里我可以评估lightgbm的前3个预测是否覆盖了真实标签。
链接here令人鼓舞
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
但是我不知道要起作用的参数的维数。似乎数据由于某种原因被拖延了。
答案 0 :(得分:1)
Scikit学习实现
from sklearn.metrics import f1_score
def lgb_f1_score(y_true, y_pred):
preds = y_pred.reshape(len(np.unique(y_true)), -1)
preds = preds.argmax(axis = 0)
print(preds.shape)
print(y_true.shape)
return 'f1', f1_score(y_true, preds,average='weighted'), True
答案 1 :(得分:1)
在通读 lgb.train 和 lgb.cv 的文档后,我不得不创建一个单独的函数 get_ith_pred
,然后在 lgb_f1_score
中重复调用它。
函数的文档字符串解释了它是如何工作的。我使用了与 LightGBM 文档中相同的参数名称。这可以用于任意数量的类,但不适用于二进制分类。在二元情况下,preds
是包含正类概率的一维数组。
from sklearn.metrics import f1_score
def get_ith_pred(preds, i, num_data, num_class):
"""
preds: 1D NumPY array
A 1D numpy array containing predicted probabilities. Has shape
(num_data * num_class,). So, For binary classification with
100 rows of data in your training set, preds is shape (200,),
i.e. (100 * 2,).
i: int
The row/sample in your training data you wish to calculate
the prediction for.
num_data: int
The number of rows/samples in your training data
num_class: int
The number of classes in your classification task.
Must be greater than 2.
LightGBM docs tell us that to get the probability of class 0 for
the 5th row of the dataset we do preds[0 * num_data + 5].
For class 1 prediction of 7th row, do preds[1 * num_data + 7].
sklearn's f1_score(y_true, y_pred) expects y_pred to be of the form
[0, 1, 1, 1, 1, 0...] and not probabilities.
This function translates preds into the form sklearn's f1_score
understands.
"""
# Only works for multiclass classification
assert num_class > 2
preds_for_ith_row = [preds[class_label * num_data + i]
for class_label in range(num_class)]
# The element with the highest probability is predicted
return np.argmax(preds_for_ith_row)
def lgb_f1_score(preds, train_data):
y_true = train_data.get_label()
num_data = len(y_true)
num_class = 70
y_pred = []
for i in range(num_data):
ith_pred = get_ith_pred(preds, i, num_data, num_class)
y_pred.append(ith_pred)
return 'f1', f1_score(y_true, y_pred, average='weighted'), True