Question

我正在尝试使用tf.contrib.learn.estimator中的Estimator API来构建，拟合和评估CNN图像分类器。我的代码基于Creating Estimators教程中的abalone.py。另外，我从cifar10 tutorial导入代码以提供模型和输入提要。代码如下：

import tensorflow as tf
import cifar10

def model_fn(features, targets, mode, params):
# Generate predictions from cifar10 network
logits = cifar10.inference(features)
prediction_dict = {"classes" : logits}

# Loss operation
loss = tf.losses.softmax_cross_entropy(targets, logits, scope='loss')

# Metrics for evaluation
eval_metric_ops = {
    "accuracy"  :   tf.metrics.accuracy(targets, logits, name='accuracy'),
    "precision" :   tf.metrics.precision(targets, logits, name='precision')
}

# Training operation
train_op = tf.contrib.layers.optimize_loss(
    loss=loss,
    global_step=tf.contrib.framework.get_global_step(),
    learning_rate=params["learning_rate"],
    optimizer="SGD")

return tf.contrib.learn.ModelFnOps(
    mode=mode,
    predictions=prediction_dict,
    loss=loss,
    train_op=train_op,
    eval_metric_ops=eval_metric_ops
)
def input_fn():
    features, labels = cifar10.distorted_inputs()
    return features, tf.one_hot(labels, 10)

def eval_input_fn():
    return cifar10.inputs(eval_data=True)

def main(args=None):
    # Set model params
    model_params = {"learning_rate": 0.1}
#Create and fit estimator
nn = tf.contrib.learn.Estimator(model_fn=model_fn, params=model_params)
nn.fit(input_fn=input_fn, steps=5000)

ev = nn.evaluate(input_fn=eval_input_fn(), steps=1)
print("Loss: %s" % ev["loss"])
print("Accuracy: %s" % ev["accuracy"])
print("Precision: %s" % ev["precision"])

if __name__ == '__main__':
  tf.app.run()

我得到的错误消息如下：

E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8507555840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.13G (7656800256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 6.42G (6891120128 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 5.78G (6202008064 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 5.20G (5581807104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

错误消息继续倒计时内存大小，并以以下三行结束：

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

Answer 1

将以下内容添加到导入

# ./map_services_to_nodes.sh 
postgres is found on this node

主要方法：

from tensorflow.contrib.learn.python.learn import run_config

使用带有Estimator API r1.0的图像时，我得到一个CUDA_ERROR_OUT_OF_MEMORY

1 个答案: