Question

对于约20,000个文本数据集，正确和错误的样本约为5,000，而正确样本为〜1,5000。使用Keras和Theano构建的两通道textCNN用于分类。 F1分数是评估指标。 F1分数还不错，而混淆矩阵表明真实样本的准确性相对较低（〜40％）。但是实际上准确地预测真实样本非常重要。因此，要设计一个自定义的二进制交叉熵损失函数，以增加错误分类的真实样本的权重，并使模型更多地专注于对真实样本的准确预测。

在model.fit方法中使用sklearn尝试对class_weight进行尝试，由于权重应用于所有样本而不是错误分类的样本，因此效果不佳。
尝试并调整了此处提到的方法：https://github.com/keras-team/keras/issues/2115，但是损失函数是分类交叉熵，对于二元分类问题它不能很好地工作。试图将损失函数修改为二进制函数，但遇到一些与输入维有关的问题。

针对错误分类的样本的成本敏感损失函数的样本代码为：

def w_categorical_crossentropy(y_true, y_pred, weights):
    nb_cl = len(weights)
    final_mask = K.zeros_like(y_pred[:, 0])
    y_pred_max = K.max(y_pred, axis=1)
    y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
    y_pred_max_mat = K.equal(y_pred, y_pred_max)
    for c_p, c_t in product(range(nb_cl), range(nb_cl)):
        final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
    return K.categorical_crossentropy(y_pred, y_true) * final_mask

实际上，用Keras和Theano实现的针对二进制分类的自定义损失函数着重于错误分类的样本，对于不平衡数据集非常重要。请帮助解决此问题。谢谢！

Answer 1

好，当我不得不处理keras中不平衡的数据集时，我要做的是首先计算每个类的权重，然后在训练过程中将它们传递给模型实例。这看起来像这样：

from sklearn.utils import compute_class_weight

w = compute_class_weight('balanced', np.unique(targets), targets)

# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
     np.unique(targets)[0] : w[0], # class 0 with weight 0
     np.unique(targets)[1] : w[1]  # class 1 with weight 1 
}

# then during training you do like this
model.fit(x=features, y=targets, {..}, class_weight=weights)

我相信这会解决您的问题。

使用Keras和Theano处理文本分类中的不平衡数据集

1 个答案: