当在一组TFRecords上运行CNN时,我得到一个OOM错误,其中Tensorflow似乎试图创建一个非常大的张量。我的模型是MNIST模型,略微适用于尺寸为200x200的RGB图像。我使用来自Inception模型的Build_image_data.py脚本创建TFrecords,然后使用来自Inception模型的dataset.py and image_processing.py脚本丢失这些脚本。
我在拥有2GB GPU内存和16GB系统内存的Nvidia 960m上运行。
我得到的错误是:
==== RESTART: C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py ====
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_max': 5, '_task_id': 0, '_session_config': None, '_master': '', '_tf_random_seed': None, '_environment': 'local', '_num_ps_replicas': 0, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600, '_evaluation_master': '', '_task_type': None, '_num_worker_replicas': 0, '_save_summary_steps': 100, '_model_dir': '/tmp/feature_model', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000013B77A3F9B0>, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1
}
}
INFO:tensorflow:Create CheckpointSaverHook.
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[160000,1024]
[[Node: dense/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense/kernel, dense/kernel/Initializer/random_uniform)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py", line 82, in <module>
tf.app.run()
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py", line 71, in main
feature_classifier.fit(input_fn=lambda:image_processing.inputs(training_data, batch_size), steps=200000, monitors=[logging_hook])
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\util\deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1003, in _train_model
config=self._session_config
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 352, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 648, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 477, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 822, in __init__
_WrappedSession.__init__(self, self._create_session())
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 827, in _create_session
return self._sess_creator.create_session()
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 538, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\monitored_session.py", line 412, in create_session
init_fn=self._scaffold.init_fn)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\session_manager.py", line 279, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[160000,1024]
[[Node: dense/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense/kernel, dense/kernel/Initializer/random_uniform)]]
Caused by op 'dense/kernel/Assign', defined at:
File "<string>", line 1, in <module>
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py", line 130, in main
ret = method(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\idlelib\run.py", line 357, in runcode
exec(code, self.locals)
File "C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py", line 82, in <module>
tf.app.run()
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py", line 71, in main
feature_classifier.fit(input_fn=lambda:image_processing.inputs(training_data, batch_size), steps=200000, monitors=[logging_hook])
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\util\deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 955, in _train_model
model_fn_ops = self._get_train_ops(features, labels)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1162, in _get_train_ops
return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\estimator.py", line 1133, in _call_model_fn
model_fn_results = self._model_fn(features, labels, **kwargs)
File "C:\Users\User\stack\Projects\Neural Network\Nippler\cnn.py", line 30, in cnn_model_fn
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\layers\core.py", line 215, in dense
return layer.apply(inputs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\layers\base.py", line 492, in apply
return self.__call__(inputs, *args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\layers\base.py", line 434, in __call__
self.build(input_shapes[0])
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\layers\core.py", line 118, in build
trainable=True)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\layers\base.py", line 374, in add_variable
trainable=trainable and self.trainable)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 367, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 725, in _get_single_variable
validate_shape=validate_shape)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variables.py", line 200, in __init__
expected_shape=expected_shape)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\variables.py", line 309, in _init_from_args
validate_shape=validate_shape).op
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\state_ops.py", line 271, in assign
validate_shape=validate_shape)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[160000,1024]
[[Node: dense/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](dense/kernel, dense/kernel/Initializer/random_uniform)]]
似乎试图分配一个形状的张量[160000,1024]。我的记录中只有大约1400个200x200 RGB图像。即使batch_size为10,为什么内存不足?
以下是我的完整示例:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib
import image_processing
import dataset
tf.logging.set_verbosity(tf.logging.INFO)
height = 200
width = 200
channels = 3
batch_size = 10
def cnn_model_fn(features, labels, mode):
input_layer = tf.reshape(features, [-1, width, height, channels])
conv1 = tf.layers.conv2d(inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
conv2 = tf.layers.conv2d(inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
pool2_flat = tf.reshape(pool2, [-1, (int(width/4)) * (int(width/4)) * 64])
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
dropout = tf.layers.dropout(inputs=dense, rate=0.4, training=mode == learn.ModeKeys.TRAIN)
logits = tf.layers.dense(inputs=dropout, units=2)
loss = None
train_op = None
if mode != learn.ModeKeys.INFER:
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=2)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
if mode == learn.ModeKeys.TRAIN:
train_op = tf.contrib.layers.optimize_loss(loss=loss, global_step=tf.contrib.framework.get_global_step(), learning_rate=0.001, optimizer="SGD")
predictions = {
"classes": tf.argmax(input=logits, axis=1),
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}
return model_fn_lib.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)
def main(unused_argv):
training_data = dataset.Dataset("train-00000-of-00001", "train")
validation_data = dataset.Dataset("validation-00000-of-00001", "validation")
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.InteractiveSession(config=config)
feature_classifier = learn.Estimator(model_fn=cnn_model_fn, model_dir="/tmp/feature_model")
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=50)
feature_classifier.fit(input_fn=lambda:image_processing.inputs(training_data, batch_size), steps=200000, monitors=[logging_hook])
metrics = { "accuracy": learn.MetricSpec(metric_fn=tf.metrics.accuracy, prediction_key="classes"),
}
if __name__ == "__main__":
tf.app.run()
我输入数据的方式是错误的吗?
答案 0 :(得分:1)
根据我的理解,大张量来自dense
中第一个完全连接的层cnn_model_fn
。两次合并之后,原始大小从 200x200 减少到 50x50 ,带有64个过滤器映射,因此dense
的输入形状为[None,64,50,50] ],并且必须具有[64 * 50 * 50,1024]的形状,这正是错误消息报告的内容。它是参数的大小,与batch_size没有任何关系。尝试减少参数数量或使用更好的GPU和更多的RAM。