Question

我正在使用OpenCV，dlib和TensorFlow库开发一个眼睛跟踪程序，并且遇到了使用CPU而不是GPU的keras函数的一些问题。

我的设置

我正在研究Jetson AGX Xavier（Jetpack 4.4），它与Ubuntu和带有Ubuntu系统的Cuda版本10.2.89一起运行。这些库是根据以下链接安装的：

OpenCV：https://github.com/mdegans/nano_build_opencv
Dlib：https://medium.com/@tran.minh.hoang.april/install-dlib-with-cuda-9-0-34c0f61fcf74
Tensorflow + Keras：https://forums.developer.nvidia.com/t/official-tensorflow-for-jetson-agx-xavier/65523

问题

好吧，我的代码运行良好，所以这不是代码问题。问题在于，他的关键功能之一是在CPU而不是GPU上运行，并且会严重影响性能。此功能是Tensorflow Keras中的predict功能。

我能够使用jtop命令监视GPU的使用情况，它接近于0。所以我开始挖掘原因。

我尝试过的事情

我首先通过检查可用设备的张量流来进行挖掘。我运行了以下命令：

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

它给了我

 [name: "/device:CPU:0"
device_type: "CPU"
...
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
 ...
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
...
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
...

因此，我至少认为Tensorflow可以识别我的GPU。然后我试着做另一个测试：

import tensorflow as tf
if tf.test.gpu_device_name():
   print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
   print("Please install GPU version of TF")

它给了我

Name: /device:GPU:0

所以在这一点上，一切似乎都还不错。我通过激活Tensorflow的日志来推进：

tf.debugging.set_log_device_placement(True)

我追踪了程序中用来检查详细日志的两个张量流函数。在我的程序中，第一个使用的函数就是这样调用的。它只被调用一次：

model = tf.keras.models.load_model('2018_12_17_22_58_35.h5', compile=True)

关联的日志为：

...
2020-10-15 22:40:31.591951: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2020-10-15 22:40:31.633533: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2020-10-15 22:40:31.636725: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2020-10-15 22:40:31.666428: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2020-10-15 22:40:31.670077: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
...

因此，该函数似乎使用了GPU。另一个函数预测是这样的：

pred_l = model.predict(eye_input)

日志为：

...
RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
2020-10-15 22:40:38.143067: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op ZipDataset in device /job:localhost/replica:0/task:0/device:CPU:0
2020-10-15 22:40:38.161602: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
2020-10-15 22:40:38.163806: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
2020-10-15 22:40:38.179115: I tensorflow/core/common_runtime/eager/execute.cc:501] Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
...

在这种情况下，日志显示此功能使用CPU，这与我的初步分析是一致的。由于此函数是在while循环中调用的（将其应用于Everey图像），因此至关重要的是在GPU上运行它以提高性能。我尝试通过使用

来强制使用GPU

with tf.device('/device:GPU:0)

但它仍然无法正常工作。

由于我已按照NVIDA的官方说明安装该库，并且由于official website表示tensorflow默认情况下将使用GPU（如果可用），所以我认为这不是安装问题。

有人需要解决这个问题吗？

谢谢。

Tensorflow使用CPU而不是GPU

0 个答案: