对于小型CNN,tensorflow执行冻结

时间:2017-01-27 07:51:42

标签: tensorflow

我正在尝试使用CPU版本在Windows 10中使用40GB内存在tensorflow中运行一些简单的卷积NN。但是,到目前为止,我一直遇到问题,执行要么在初始化变量后立即挂起,要么在几次训练迭代后挂起。以下是我的代码和我想要实现的内容的摘要。

我在每个图像中都有一个五个字母的序列,我想训练CNN识别每个图像中的序列。为此,我有两个卷积层(高度/宽度/通道:4/4 / 5,4 / 4/10),每个卷入一个relu层,然后是两个完全连接的relu层,最后是一个带有交叉层的softmax层熵损失函数。

num_image = 5
image_size = (28, 150)
out_channel = 5;
shape_conv1 = [4, 4, 1, out_channel]  # height, width, in_channel, out_channel
stride_conv1 = [1, 2, 2, 1]
shape_conv2 = [4, 4, out_channel, out_channel*2]  # height, width, in_channel, out_channel
stride_conv2 = [1, 2, 2, 1]
num_layer1 = 100
num_layer2 = 100

num_output = 10
num_batch = 200

size_intermediate = [1, np.ceil(np.ceil(image_size[0]/stride_conv1[1])/stride_conv2[1]), \
                 np.ceil(np.ceil(image_size[1]/stride_conv1[2])/stride_conv2[2]), out_channel*2]
size_trans = [int(i) for i in size_intermediate]

with graph.as_default():
    input_data = tf.placeholder(tf.float32, [num_batch-num_image+1, image_size[0], image_size[1], 1])
    input_labels = tf.placeholder(tf.float32, [num_image, num_batch-num_image+1, num_output])
    reg_coeff = tf.placeholder(tf.float32)

    weights_conv1 = tf.Variable(tf.truncated_normal(shape_conv1, 0.0, 0.1))
    bias_relu1 = tf.Variable(tf.zeros([out_channel]))
    weights_conv2 = tf.Variable(tf.truncated_normal(shape_conv2, 0.0, 0.1))
    bias_relu2 = tf.Variable(tf.zeros([out_channel*2]))
    weights_layer1 = tf.Variable(tf.truncated_normal(\
                        [num_image, size_trans[1]*size_trans[2]*size_trans[3], num_layer1], \
                        0.0, (num_layer1)**-0.5))
    bias_layer1 = tf.zeros([num_image, 1, num_layer1])
    weights_layer2 = tf.Variable(tf.truncated_normal([num_image, num_layer1, num_layer2], \
                        0.0, (num_layer2)**-0.5))
    bias_layer2 = tf.zeros([num_image, 1, num_layer2])
    weights_output = tf.Variable(tf.truncated_normal([num_image, num_layer2, num_output], 0.0, num_output**-0.5))
    bias_output = tf.zeros([num_image, 1, num_output])

    output_conv1 = tf.nn.conv2d(input_data, weights_conv1, stride_conv1, "SAME")
    output_relu1 = tf.nn.relu(output_conv1 + bias_relu1)
    output_conv2 = tf.nn.conv2d(output_relu1, weights_conv2, stride_conv2, "SAME")
    output_relu2 = tf.nn.relu(output_conv2 + bias_relu2)

    shape_inter = output_relu2.get_shape().as_list()
    input_inter = tf.reshape(output_relu2, [1, shape_inter[0], shape_inter[1]*shape_inter[2]*shape_inter[3]])
    ## One copy for each letter recognizer
    input_mid = tf.tile(input_inter, [num_image, 1, 1])

    input_layer1 = tf.matmul(input_mid, weights_layer1) + bias_layer1
    output_layer1 = tf.nn.relu(input_layer1)
    input_layer2 = tf.matmul(output_layer1, weights_layer2) + bias_layer2
    output_layer2 = tf.nn.relu(input_layer2)
    logits = tf.matmul(output_layer2, weights_output) + bias_output

    # Training prediction
    train_prediction = tf.nn.softmax(logits)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, input_labels))
    # Loss term for regularization
    loss_reg = reg_coeff*(tf.nn.l2_loss(weights_layer1)+tf.nn.l2_loss(bias_layer1)\
                                +tf.nn.l2_loss(weights_layer2)+tf.nn.l2_loss(bias_layer2)\
                                +tf.nn.l2_loss(weights_output)+tf.nn.l2_loss(bias_output))

    learning_rate = 0.1
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss+loss_reg)

CNN相当简单且相当小,所以当我看到它在初始化所有变量后冻结时,或者在经过几次训练后最多都冻结时,我感到非常惊讶。没有任何输出,ctrl + c不会中断执行。我想知道它是否与Windows版本的Tensorflow有任何关系,但我目前正在寻找线索。

有人可以分享他们对可能导致我的问题的建议/意见吗?谢谢!

编辑: 正如评论中指出的那样,我向数据提供数据的方式可能存在问题。因此,我也在下面发布了这部分代码。

num_steps = 20000

fixed_input = np.random.randint(0, 256, [num_batch-num_image+1, 28, 150, 1])
fixed_label = np.tile((np.random.choice(10, [num_batch-num_image+1, 1])==np.arange(10)).astype(np.float32), (5, 1, 1))

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    loss1 = 0.0
    loss2 = 0.0

    for i in range(1, num_steps+1):
        feed_dict = {input_data : fixed_input, input_labels : fixed_label, reg_coeff : 2e-4}
        _, l1, l2, predictions = session.run([optimizer, loss, loss_reg, train_prediction], feed_dict=feed_dict)
        loss1 += l1
        loss2 += l2

        if i % 500 == 0:
            print("Batch/reg loss at step %d: %f, %f" % (i, loss1/500, loss2/500))
            loss1 = 0.0
            loss2 = 0.0
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, fixed_labels))

我只是使用随机输入及其标签来测试代码是否运行。不幸的是,执行再次在训练的前几个步骤中冻结。

1 个答案:

答案 0 :(得分:0)

很可能是由于队列运行者。 您能否尝试在MonitoredSession中运行您的训练步骤,该步骤处理引擎下的初始化和队列运行。它看起来如下:

with tf.train.MonitoredTrainingSession(...) as sess:
  sess.run(optimizer)