Question

我一直在尝试训练使用Keras的Tensorflow实现编写的CNN。根据nvidia-smi的说法，似乎在到达第一个纪元时训练就停滞了-尽管看起来我的GPU仍在使用内存。终端上也没有错误消息或回溯信息，这对我来说调试起来有些棘手。我还使用TF估算器和数据集编写了此代码，当我将其放置过夜时，网络没有进行训练。因此，我不认为这只是让代码运行更长的时间的情况-这可能是我已经做过的事情，但这也可能是由于下面第二个链接中的（据称已修复）错误所致。

目前，我还尝试使用model.fit（）中的“ verbose”参数来跟踪训练过程，以查看是否发生了任何事情。我没有在终端中看到任何东西。其他遇到此问题的人似乎仍然会出现进度条。

我也尝试过使用TensorBoard记录日志并保存模型检查点。没有保存任何检查点，关于Tensorboard，它看起来也没有保存任何图形。

关于可能是什么原因的任何想法？

Can't get past first epoch -- just hangs [Keras Transfer Learning Inception]

Keras fit freezes at the end of the first epoch

import os
import tensorflow as tf
from tensorflow import keras
import cv2
import numpy as np
from tensorflow.python.framework.graph_util import convert_variables_to_constants
from tensorflow.python.keras import backend as K

cwd = os.getcwd()
log_dir = cwd + "/Keras_Model/"
callbacks = [keras.callbacks.ModelCheckpoint(filepath="./Checkpoints/weights.{epoch:02d}-{val_loss:.2f}.hdf5"),
         keras.callbacks.TensorBoard(log_dir="./logs")]

def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
"""
TAKEN FROM HERE: https://stackoverflow.com/questions/45466020/how-to-export-keras-h5-to-tensorflow-pb
Freezes the state of a session into a pruned computation graph. Used later to save model as TF pb file.

Creates a new computation graph where variable nodes are replaced by
constants taking their current value in the session. The new graph will be
pruned so subgraphs that are not necessary to compute the requested
outputs are removed.

@param session The TensorFlow session to be frozen.
@param keep_var_names A list of variable names that should not be frozen,
                      or None to freeze all the variables in the graph.
@param output_names Names of the relevant graph outputs.
@param clear_devices Remove the device directives from the graph for better portability.
@return The frozen graph definition.
"""
graph = session.graph
with graph.as_default():
    freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
    output_names = output_names or []
    output_names += [v.op.name for v in tf.global_variables()]
    input_graph_def = graph.as_graph_def()
    if clear_devices:
        for node in input_graph_def.node:
            node.device = ""
    frozen_graph = convert_variables_to_constants(session, input_graph_def,
                                                  output_names, freeze_var_names)
    return frozen_graph

### IMPORT TRAINING IMAGES AS NUMPY ARRAY ###

t_dir = cwd + "/data-1/training/" 
e_dir = cwd + "/data-1/evaluation"

xtrain = []
ytrain = []

print(" - Collating training data and labels... - ")

for subdir, dirs, files in os.walk(t_dir):
    for f in files:
        img = os.path.join(subdir, f)
        x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
        y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
        xtrain.append(x)
        ytrain.append(y)

data = np.asarray(xtrain)
print(" - Training data collated. - ")
labels = np.asarray(ytrain)
print(" - Training labels collated. - ")


### IMPORT EVALUATION IMAGES AS TF ITERATOR ###

xeval = []
yeval = []

print(" - Collating validation data and labels... - ")

for subdir, dirs, files in os.walk(e_dir):
    for f in files:
        img = os.path.join(subdir, f)
        x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
        y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
        xeval.append(x)
        yeval.append(y)

 val_data = np.asarray(xeval)
 print(" - Validation data collated. - ")
 val_labels = np.asarray(yeval)
 print(" - Validation labels collated. - ")

 ### CREATE MODEL ###

 model = keras.Sequential()

 model.add(keras.layers.Conv2D(filters=32, kernel_size=5, strides=1, padding="same", data_format = "channels_last", activation="relu", input_shape=    (480,640,3)))

 model.add(keras.layers.GlobalMaxPool2D(data_format = "channels_last"))

 model.add(keras.layers.Dense(64, activation="relu"))

 model.add(keras.layers.Dropout(0.4)) # --> Change dropout rate here.

 model.add(keras.layers.Dense(8, activation="softmax"))

 model.compile(optimizer=tf.train.AdamOptimizer(0.001), # --> Choose learning rate here.
          loss=keras.losses.sparse_categorical_crossentropy,
          metrics=[keras.metrics.categorical_accuracy])

print(" - Model created... - ")
print(" - Model Summary - ")
model.summary() # --> Print model summary.

### TRAIN AND EVALUATE MODEL ###

print(" - Training model... - ")
model.fit(data, labels, epochs = 5, batch_size=32, callbacks=callbacks, validation_data=(val_data, val_labels), verbose = 2)
print(" - Model trained! - ")

### SAVE MODEL AS H5 AND PB FILES ###

model.save("./Keras_Model/model.h5", save_format="h5")
print(" - Saved model as h5. - ")

frozen_graph = freeze_session(K.get_session(), output_names=[out.op.name for out in model.outputs])
tf.train.write_graph(frozen_graph, "./Tensorflow_Model/", "model.pb", as_text=False)
print(" - Saved model as pb. - ")

print(" - Clearing session. - ")
keras.clear_session()

我还可以提供使用TF数据集和评估程序的版本，或者，如果可以的话，还可以提供其他任何版本。抱歉，如果我遗漏了任何明显的内容，我就开始使用SO。

更新：昨晚我回到家，并在计算机上运行了此脚本-看起来很明显，这不是使用方面的问题，但可能是TF本身存在问题或在我们的服务器上配置它的方式。这有点奇怪，因为TF以前在某个时候可以工作，但是您能做什么。欢呼雀跃。

tf.keras-尽管使用了GPU内存，但第一个纪元的培训仍未进行

0 个答案: