@ tf.function正在减慢训练步骤

时间:2019-12-12 13:41:18

标签: python tensorflow keras

我正在使用以下带有tf.function装饰的训练步骤:

@tf.function
def train_step(inputs, labels):
    with tf.GradientTape(persistent=True) as tape:
        predictions = model([X, F], training=True)
        losses = [l_f(tf.expand_dims(labels[:,i], axis=-1), predictions[i]) for i, l_f in enumerate(loss_functions)]
    gradients = [tape.gradient(l, model.trainable_variables) for l in losses]
    for g in gradients:
        grads = [gg if gg is not None else tf.zeros_like(model.trainable_variables[i], dtype=tf.float32) for i, gg in enumerate(g)]
        optimizer.apply_gradients(zip(grads, model.trainable_variables)
    del tape
    return losses


def weighted_loss(weights):
    @tf.function
    def loss_func(labels, predictions):
        min_class_filter = tfk.backend.greater(labels, 0.5)

        y_min = tf.boolean_mask(labels, min_class_filter)
        y_max = tf.boolean_mask(labels, tf.math.logical_not(min_class_filter))
        y_pred_min = tf.boolean_mask(predictions, min_class_filter)
        y_pred_max = tf.boolean_mask(predictions, tf.math.logical_not(min_class_filter))

        loss_min_class = tfk.backend.mean(tfk.backend.binary_crossentropy(y_min, y_pred_min))
        loss_max_class = tfk.backend.mean(tfk.backend.binary_crossentropy(y_max, y_pred_max))
        loss_all = tfk.backend.mean(tfk.backend.binary_crossentropy(labels, predictions))
        return weights[0]*loss_min_class + weights[1]*loss_max_class + weights[2]*loss_all
    return loss_func

loss_functions = [weighted_loss(w) for w in target_weights]

这有点古怪,但是基本上,我的网络有多个输出,这意味着在某些情况下,对于某些权重返回None的梯度是正确的,因此我将这些梯度替换为零,并且正在计算这些输出中的每一个分别损失,然后在每个步骤中传播它们。

以书面形式运行此代码时,要花费极长的时间(超过10分钟)来运行单个训练步骤,并且在日志中看到以下消息:

E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] function_operator failed: Invalid argument: Input 0 of node model/LSTM_forward_0/zeros_like was passed int32 from model/LSTM_forward_0/StatefulPartitioned Call:9 incompatible with expected variant.

当我删除@ tf.function装饰器时,它会在大约10%的时间内运行,并且看不到此日志警告。这是否是一条红色警告,还是正当地指向通过添加@ tf.function所创建的问题?

其他详细信息:

  • TF 2.0
  • 已启用GPU并可用
  • CUDA 10.1
  • 在两种情况下,GPU利用率均为0%,但这不是由于数据馈送最大程度地提高了CPU吞吐量,因为当我在训练循环之外生成训练数据时,它与TFRecords的即时性一样好,并且具有足够的预提取和有限的扩充
  • 输入,标签,渐变和所有model.trainable_variables的类型均为tf.float32

2 个答案:

答案 0 :(得分:0)

根据我的阅读,tf.function不应包括对图var的任何分配,以使其平稳运行。

在训练步骤中,您正在更改模型的权重,从而违反了此规定。

我不确定这是原因,但是您可以尝试仅将tf.function留在损失函数中,而不留在训练步骤中。

答案 1 :(得分:0)

我已经找到了解决方法。问题在于覆盖无梯度,而不是覆盖永久性的梯度磁带。

@tf.function
def train_step(inputs, labels):
    with tf.GradientTape(persistent=True) as tape:
        predictions = model([X, F], training=True)
        losses = [l_f(labels, predictions, i) for i, l_f in enumerate(loss_functions)]
    gradients = [tape.gradient(l, model.trainable_variables) for l in losses]
    for g in gradients:
        optimizer.apply_gradients(zip(g, model.trainable_variables)
    del tape
    return losses


def weighted_loss(weights):
    @tf.function
    def loss_func(labs, preds, i):
        labels = tf.expand_dims(labs[:,i], axis=-1)
        predictions = preds[i]
        min_class_filter = tfk.backend.greater(labels, 0.5)

        y_min = tf.boolean_mask(labels, min_class_filter)
        y_max = tf.boolean_mask(labels, tf.math.logical_not(min_class_filter))
        y_pred_min = tf.boolean_mask(predictions, min_class_filter)
        y_pred_max = tf.boolean_mask(predictions, tf.math.logical_not(min_class_filter))

        loss_min_class = tfk.backend.mean(tfk.backend.binary_crossentropy(y_min, y_pred_min))
        loss_max_class = tfk.backend.mean(tfk.backend.binary_crossentropy(y_max, y_pred_max))
        loss_all = tfk.backend.mean(tfk.backend.binary_crossentropy(labels, predictions))
        return weights[0]*loss_min_class + weights[1]*loss_max_class + weights[2]*loss_all
    return loss_func

loss_functions = [weighted_loss(w) for w in target_weights]

通过将所有输出和所有标签传递到损失函数(即使我忽略了其中的一堆),磁带将为所有分支返回适当的梯度(0),而不仅仅是针对特定损失的焦点。 / p>