Question

根据the tensorflow documentation，我尝试在keras风格的tensorflow 2.0中使用自动混合精度（AMP）。这是我的代码：

#!/usr/bin/env python
# coding: utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_hub as hub
import tensorflow.keras.mixed_precision.experimental as mixed_precision
import tensorflow.keras.layers as layers
import numpy as np
import tensorflow as tf

# we can use mixed precision with the following line
policy = mixed_precision.Policy('mixed_float16')
# policy = mixed_precision.Policy('float32')
mixed_precision.set_policy(policy)
print('Compute dtype: %s' % policy.compute_dtype)
print('Variable dtype: %s' % policy.variable_dtype)
num_samples = 1024
batch_size = 16
max_seq_len = 128
num_class = 16
epochs = 3
vocab_size = 30522

# BERT_PATH = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1'
BERT_PATH = '../input/bert-base-from-tfhub/bert_en_uncased_L-12_H-768_A-12'


def bert_model():
    input_ids = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_ids')
    input_masks = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_masks')
    input_segments = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_segments')

    bert_layer = hub.KerasLayer(BERT_PATH, trainable=True)

    print('bert_layer._dtype_policy:', bert_layer._dtype_policy)
    print('bert_layer._compute_dtype:', bert_layer._compute_dtype)
    print('bert_layer._dtype:', bert_layer._dtype)

    _, bert_sequence_output = bert_layer([input_ids, input_masks, input_segments])

    print("bert_sequence_output.dtype:", bert_sequence_output.dtype)

    x = layers.GlobalAveragePooling1D()(bert_sequence_output)
    logits = layers.Dense(num_class, name="logits")(x)

    print("logits.dtype:", logits.dtype)

    # when using mixed precision, regardless of what your model ends in, make sure the output is float32.
    output = layers.Activation('sigmoid', dtype='float32', name='output')(logits)
    print('output.dtype:', output.dtype)

    model = tf.keras.models.Model(inputs=[input_ids, input_masks, input_segments], outputs=output)
    return model


# make dummy inputs
train_X = []
train_X.append(np.random.randint(0, vocab_size, size=(num_samples, max_seq_len)))  # ids
train_X.append(np.zeros(shape=(num_samples, max_seq_len)))  # masks
train_X.append(np.zeros(shape=(num_samples, max_seq_len)))  # segments
train_Y = np.random.randn(num_samples, num_class)  # labels

model = bert_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(loss="binary_crossentropy", optimizer=optimizer)
model.fit(train_X, train_Y, epochs=epochs, verbose=1, batch_size=batch_size)

我期望的结果：

bert_sequence_output.dtype应该为float16，因为它是使用bert_layer策略的图层（即mixed_float16）的输出。

但是我实际上得到的是：

上面的代码告诉我bert_sequence_output.dtype是float32，这是完整的日志：

ssh://xiepengyu@192.168.0.200:22/home/xiepengyu/miniconda3/envs/tf2/bin/python -u /home/xiepengyu/google_quest/scripts/multi_bert_aug_mixed_precision_test.py
2020-01-05 11:30:50.951010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:51.380306: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xiepengyu/cuda/cuda-10.1/lib64:$LD_LIBRARY_PATH
2020-01-05 11:30:51.380387: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xiepengyu/cuda/cuda-10.1/lib64:$LD_LIBRARY_PATH
2020-01-05 11:30:51.380399: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-01-05 11:30:52.292392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-05 11:30:52.635553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-01-05 11:30:52.635599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:52.637236: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-05 11:30:52.638264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-05 11:30:52.638493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-05 11:30:52.640188: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-05 11:30:52.641278: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-05 11:30:52.644628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-05 11:30:52.650678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-05 11:30:52.650998: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-05 11:30:52.658229: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499720000 Hz
2020-01-05 11:30:52.658878: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562d05824cc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-05 11:30:52.658896: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-05 11:30:52.871435: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562d058cb200 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-05 11:30:52.871481: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-01-05 11:30:52.875039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-01-05 11:30:52.875109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:52.875137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-05 11:30:52.875149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-05 11:30:52.875161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-05 11:30:52.875172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-05 11:30:52.875183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-05 11:30:52.875195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-05 11:30:52.876635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-05 11:30:53.444364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-05 11:30:53.444427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-01-05 11:30:53.444436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-01-05 11:30:53.450671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10392 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
Compute dtype: float16
Variable dtype: float32
bert_layer._dtype_policy: <Policy "mixed_float16", loss_scale=DynamicLossScale(current_loss_scale=32768.0, num_good_steps=0, initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0)>
bert_layer._compute_dtype: float16
bert_layer._dtype: float32
bert_sequence_output.dtype: <dtype: 'float32'>
logits.dtype: <dtype: 'float16'>
output.dtype: <dtype: 'float32'>
Train on 1024 samples
Epoch 1/3
2020-01-05 11:31:06.079381: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 1161 in the outer inference context.
/home/xiepengyu/miniconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-01-05 11:31:08.348584: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 1161 in the outer inference context.
2020-01-05 11:31:18.719649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
1024/1024 [==============================] - 34s 33ms/sample - loss: 0.0720
Epoch 2/3
1024/1024 [==============================] - 15s 15ms/sample - loss: 0.0185
Epoch 3/3
1024/1024 [==============================] - 15s 15ms/sample - loss: 0.0042

Process finished with exit code 0

当我将策略更改为float32时，几个print会给我这些信息（日志的其他部分与mixed_float16相同）：

Compute dtype: float32
Variable dtype: float32
bert_layer._dtype_policy: <Policy "float32", loss_scale=None>
bert_layer._compute_dtype: float32
bert_layer._dtype: float32
bert_sequence_output.dtype: <dtype: 'float32'>
logits.dtype: <dtype: 'float32'>
output.dtype: <dtype: 'float32'>

根据日志，这是我的结论：

mixed_float16策略在其他自定义层（例如，密集层称为“ logits”，因为其输出具有dtype float16。
已将Bert层的策略设置为mixed_float16，但是从bert_sequence_output.dtype的dtype来看，似乎float32似乎不起作用。另一个证据是，在两种情况下，GPU内存使用率（由Bert层中的变量决定）几乎是相同的。

我个人认为这是由于Bert中定义的层已被硬编码为dtype float32，所以我们不能使用mixed_float策略来更改其行为。 对吗？还有什么可能导致该问题以及如何解决？

感谢所有提前的帮助！

Answer 1

GTX 1080 Ti不支持混合精度训练。您需要NVIDIA RTX显卡。 2000系列具有张量核心，因此支持混合精度。

Answer 2

据我所知

在混合精度训练中，

bert_sequence_output.dtype和output.dtype不需要float16。您可以在链接的文档-tensorflow文档上进行检查。

原因是由于溢出问题，某些计算结果应为fp32-用于归一化和softmax，这可能会在对大矩阵的所有元素求和时引起溢出。

我认为您最好检查一下训练速度，以查看该策略的有效性，而不是变量类型。

如果您的GPU是1080 TI，则由于没有用于快速fp16计算而设计的Tensor Core，因此没有什么改进。但是它确实支持混合精度训练。只是速度有所不同。

如何在带有hub.KerasLayer

2 个答案: