我有一个用于双层神经网络的自定义估算器,我想在每个时期提取验证集性能以控制早期停止。我正在使用TensorFlow 1.41而我正在运行3 970个GTX GPU。我的估算器看起来像这样(将来我想在我的网络中自定义架构,所以我不能使用现成的估算器):
def my_estimator(features, labels, mode, params):
""" NN estimator function """
# set up regularizer
regularizer = tf.contrib.layers.l2_regularizer(scale=params["l2_lambda"])
drop_rate = scale=params["dropout"]
# input layer will just be features
input_layer = features
# Hidden fully connected layer with n_hidden_1 neurons
layer_1 = tf.layers.dense(inputs = input_layer,units= n_hidden_1, activation = tf.nn.relu, kernel_regularizer=regularizer)
# droupout for Hidden 1
dropout_1 = tf.layers.dropout(inputs=layer_1, rate=drop_rate,
training=mode == tf.estimator.ModeKeys.TRAIN)
# Hidden fully connected layer with n_hidden_2 neurons
layer_2 = tf.layers.dense(inputs = dropout_1,units= n_hidden_2, activation = tf.nn.relu,kernel_regularizer=regularizer)
# droupout for Hidden 2
dropout_2 = tf.layers.dropout(inputs=layer_2, rate=drop_rate,
training=mode == tf.estimator.ModeKeys.TRAIN)
# Output fully connected layer with p neurons
output_layer = tf.layers.dense(inputs = dropout_2,units= p, activation= None,kernel_regularizer=regularizer)
# Reshape output layer to 1-dim Tensor to return predictions
predictions = tf.reshape(output_layer, [-1])
predictions_dict = {"predictions": predictions}
# If prediction mode, early return
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode, predictions=predictions_dict)
# Calculate loss using mean squared error
objective_loss = tf.losses.mean_squared_error(tf.reshape(labels, [-1]), predictions)
reg_losses = tf.cast(tf.losses.get_regularization_loss(),tf.float32)
loss = objective_loss+ reg_losses
# add eval metric ops
eval_metric_ops = {"rmse": tf.metrics.root_mean_squared_error(tf.reshape(labels, [-1]), predictions),
"mae": tf.metrics.mean_absolute_error(tf.reshape(labels, [-1]), predictions)}
# build tranning op
learning_rate = tf.train.exponential_decay(params["learning_rate"], tf.train.get_global_step(),
100, 0.96, staircase=True)
gdoptim = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op =gdoptim.minimize(loss=loss, global_step=tf.train.get_global_step())
grads_and_vars = gdoptim.compute_gradients(loss)
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
# add some additional information to the summary writer
tf.summary.scalar("cost", objective_loss)
tf.summary.scalar("reg_cost", reg_losses)
tf.summary.scalar("grad_norm", grad_norm)
tf.summary.scalar("learningrate", learning_rate)
summary_op = tf.summary.merge_all()
return tf.estimator.EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)
我的培训如下:
nn = tf.estimator.Estimator(model_fn=my_estimator, params=model_params,config=run_config)
# for each training epoch:
for i in range(training_epochs):
nn.train(input_fn=train_input_fn,steps=num_batches)
val_result = nn.evaluate(input_fn=eval_val_input_fn)
# decide to break based on val_result
train_input_fn是一个批量移动的tf.Dataset迭代器,而eval_val_input_fn遍历1个批处理中的小得多的集合。数据是所有连续的特征,标签是标量(它是一个回归任务)。我遇到的问题是我的输出充斥着数百个设备创建实例:
2018-01-29 17:33:49.721025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)
2018-01-29 17:33:49.721047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: GeForce GTX 970, pci bus id: 0000:03:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: GeForce GTX 970, pci bus id: 0000:03:00.0, compute capability: 5.2)
2018-01-29 17:33:50.260143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-01-29 17:33:50.260177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)
tf.Estimators似乎隐藏了很多关于会话和用户图表的信息。我不认为有必要多次创建设备 - 这肯定可以在一个会话中完成吗?
答案 0 :(得分:1)
当您调用estimator.train时,它会启动一个新会话并从检查点重新加载权重。因此,它不断创建新的会话和设备。