tensorflow:jupyter内核在运行卷积网络时死亡

时间:2018-05-21 05:12:35

标签: python tensorflow neural-network jupyter-notebook

我正试图从Sewak等人的“实用卷积神经网络”一书中的代码样本运行演示卷积神经网络。人。这是一个使用Tensorflow的简单狗/猫分类器。问题是我在Jupyter笔记本中运行这个Tensorflow代码,当我执行代码开始训练网络时,内核一直在死。我不确定这是否是笔记本电脑的问题,或者是否在演示代码中缺少某些内容,或者这是一个已知问题而且我不应该使用jupyter笔记本进行培训?

enter image description here

因此,让我提供有关环境的一些细节。我有一个Docker容器,它安装了Tensorflow GPU,Keras和其他CUDA库。我的电脑上有3个GPU。在容器内部安装了Miniconda,,因此我可以加载和运行笔记本等。

以下是我的一些想法,这可能导致笔记本Python 3.6内核死亡。

  1. 我没有具体确定要在Tensorflow代码中使用的GPU。
  2. 现在可能存在允许容器中的内存增长(https://github.com/tensorflow/tensorflow/issues/9829
  3. 的问题

    我对Tensorflow不太熟悉,但还不知道问题的根源。由于代码在容器内运行,因此通常的调试工具会受到更多限制。

    完整的培训代码位于github存储库:https://github.com/PacktPublishing/Practical-Convolutional-Neural-Networks/blob/master/Chapter03/Dog_cat_classification/CNN_DogvsCat_Classifier.py

    以下是用于培训的optimize函数。现在确定是否有人可以看到某些特定功能丢失。

    def optimize(num_iterations):
        # Ensure we update the global variable rather than a local copy.
        global total_iterations
    
        # Start-time used for printing time-usage below.
        start_time = time.time()
    
        best_val_loss = float("inf")
        patience = 0
    
        for i in range(total_iterations, total_iterations + num_iterations):
    
            # Get a batch of training examples.
            # x_batch now holds a batch of images and
            # y_true_batch are the true labels for those images.
            x_batch, y_true_batch, _, cls_batch = data.train.next_batch(train_batch_size)
            x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(train_batch_size)
    
            # Convert shape from [num examples, rows, columns, depth]
            # to [num examples, flattened image shape]
    
            x_batch = x_batch.reshape(train_batch_size, img_size_flat)
            x_valid_batch = x_valid_batch.reshape(train_batch_size, img_size_flat)
    
            # Put the batch into a dict with the proper names
            # for placeholder variables in the TensorFlow graph.
            feed_dict_train = {x: x_batch, y_true: y_true_batch}        
            feed_dict_validate = {x: x_valid_batch, y_true: y_valid_batch}
    
            # Run the optimizer using this batch of training data.
            # TensorFlow assigns the variables in feed_dict_train
            # to the placeholder variables and then runs the optimizer.
            session.run(optimizer, feed_dict=feed_dict_train)        
    
            # Print status at end of each epoch (defined as full pass through training Preprocessor).
            if i % int(data.train.num_examples/batch_size) == 0: 
                val_loss = session.run(cost, feed_dict=feed_dict_validate)
                epoch = int(i / int(data.train.num_examples/batch_size))
    
                acc, val_acc = print_progress(epoch, feed_dict_train, feed_dict_validate, val_loss)
                msg = "Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
                print(msg.format(epoch + 1, acc, val_acc, val_loss))
                print(acc)
                acc_list.append(acc)
                val_acc_list.append(val_acc)
                iter_list.append(epoch+1)
    
                if early_stopping:    
                    if val_loss < best_val_loss:
                        best_val_loss = val_loss
                        patience = 0
                    else:
                        patience += 1
                    if patience == early_stopping:
                        break
    
        # Update the total number of iterations performed.
        total_iterations += num_iterations
    
        # Ending time.
        end_time = time.time()
    
        # Difference between start and end-times.
        time_dif = end_time - start_time
    
        # Print the time-usage.
        print("Time elapsed: " + str(timedelta(seconds=int(round(time_dif)))))
    

0 个答案:

没有答案