Question

我正试图从Sewak等人的“实用卷积神经网络”一书中的代码样本运行演示卷积神经网络。人。这是一个使用Tensorflow的简单狗/猫分类器。问题是我在Jupyter笔记本中运行这个Tensorflow代码，当我执行代码开始训练网络时，内核一直在死。我不确定这是否是笔记本电脑的问题，或者是否在演示代码中缺少某些内容，或者这是一个已知问题而且我不应该使用jupyter笔记本进行培训？

因此，让我提供有关环境的一些细节。我有一个Docker容器，它安装了Tensorflow GPU，Keras和其他CUDA库。我的电脑上有3个GPU。在容器内部安装了Miniconda,，因此我可以加载和运行笔记本等。

以下是我的一些想法，这可能导致笔记本Python 3.6内核死亡。

我没有具体确定要在Tensorflow代码中使用的GPU。
现在可能存在允许容器中的内存增长（https://github.com/tensorflow/tensorflow/issues/9829）

我对Tensorflow不太熟悉，但还不知道问题的根源。由于代码在容器内运行，因此通常的调试工具会受到更多限制。

完整的培训代码位于github存储库：https://github.com/PacktPublishing/Practical-Convolutional-Neural-Networks/blob/master/Chapter03/Dog_cat_classification/CNN_DogvsCat_Classifier.py

以下是用于培训的optimize函数。现在确定是否有人可以看到某些特定功能丢失。

def optimize(num_iterations):
    # Ensure we update the global variable rather than a local copy.
    global total_iterations

    # Start-time used for printing time-usage below.
    start_time = time.time()

    best_val_loss = float("inf")
    patience = 0

    for i in range(total_iterations, total_iterations + num_iterations):

        # Get a batch of training examples.
        # x_batch now holds a batch of images and
        # y_true_batch are the true labels for those images.
        x_batch, y_true_batch, _, cls_batch = data.train.next_batch(train_batch_size)
        x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(train_batch_size)

        # Convert shape from [num examples, rows, columns, depth]
        # to [num examples, flattened image shape]

        x_batch = x_batch.reshape(train_batch_size, img_size_flat)
        x_valid_batch = x_valid_batch.reshape(train_batch_size, img_size_flat)

        # Put the batch into a dict with the proper names
        # for placeholder variables in the TensorFlow graph.
        feed_dict_train = {x: x_batch, y_true: y_true_batch}        
        feed_dict_validate = {x: x_valid_batch, y_true: y_valid_batch}

        # Run the optimizer using this batch of training data.
        # TensorFlow assigns the variables in feed_dict_train
        # to the placeholder variables and then runs the optimizer.
        session.run(optimizer, feed_dict=feed_dict_train)        

        # Print status at end of each epoch (defined as full pass through training Preprocessor).
        if i % int(data.train.num_examples/batch_size) == 0: 
            val_loss = session.run(cost, feed_dict=feed_dict_validate)
            epoch = int(i / int(data.train.num_examples/batch_size))

            acc, val_acc = print_progress(epoch, feed_dict_train, feed_dict_validate, val_loss)
            msg = "Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
            print(msg.format(epoch + 1, acc, val_acc, val_loss))
            print(acc)
            acc_list.append(acc)
            val_acc_list.append(val_acc)
            iter_list.append(epoch+1)

            if early_stopping:    
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience = 0
                else:
                    patience += 1
                if patience == early_stopping:
                    break

    # Update the total number of iterations performed.
    total_iterations += num_iterations

    # Ending time.
    end_time = time.time()

    # Difference between start and end-times.
    time_dif = end_time - start_time

    # Print the time-usage.
    print("Time elapsed: " + str(timedelta(seconds=int(round(time_dif)))))

tensorflow：jupyter内核在运行卷积网络时死亡

0 个答案: