Question

我目前正在研究谷歌tensorflow object detection API。当我尝试使用Oxford III pet数据集重新训练模型时，训练过程非常缓慢。

这是我到目前为止所发现的：

大部分时间只有2％的GPU可以使用。
但CPU利用率为60％，因此看起来GPU不会受到输入的影响，否则CPU应该接近100％的利用率。

我试图用张量流分析器来描述它，但我现在有点匆忙，任何想法或建议都会有所帮助。

Answer 1

我可以看到，它现在没有使用GPU，您是否尝试使用张量流给定参数来优化GPU

alternate version

Answer 2

我发现了问题。这是输入的问题，我的tfrecord文件以某种方式被破坏，所以输入线程有时会挂断。

Answer 3

发生这种情况的原因很多。最常见的是您的record文件存在问题。在添加图像及其轮廓以记录文件之前，需要进行一些测试。其中一些是：

先检查图像，然后再将其发送到记录：

def checkJPG(fn):
    with tf.Graph().as_default():
        try:
            image_contents = tf.read_file(fn)
            image = tf.image.decode_jpeg(image_contents, channels=3)
            init_op = tf.initialize_all_tables()
            with tf.Session() as sess:
                sess.run(init_op)
                tmp = sess.run(image)
        except:
            print("Corrupted file: ", fn)
            return False
    return True

还要检查轮廓的高度和宽度，以及是否有轮廓没有越过边界：

boxW = xmax - xmin
boxH = ymax - ymin
if boxW == 0 or boxH == 0:
    print("...ONE CONTOUR SKIPPED... (boxW | boxH) = 0")
    continue

if boxW*boxH < 100:
    print("...ONE CONTOUR SKIPPED... (boxW*boxH) < 100")
    continue

if xmin / width <= 0 or xmax / width <= 0 or ymin / height <= 0 or ymax / height <= 0:
    print("...ONE CONTOUR SKIPPED... (x | y) <= 0")
    continue
if xmin / width >= 1 or xmax / width >= 1 or ymin / height >= 1 or ymax / height >= 1:
    print("...ONE CONTOUR SKIPPED... (x | y) >= 1")
    continue

另一个原因之一是评估record文件中的数据太多。最好在评估记录文件中仅添加10张图像，并按如下所示更改评估配置：

eval_config {
  num_visualizations: 10
  num_examples: 10
  eval_interval_secs: 3000
  max_evals: 1
  use_moving_averages: false
}

tensorflow对象检测API：训练非常慢

3 个答案: