如何从Tensorflow中的新数据输入到CNN获取预测和概率

时间:2017-10-07 14:15:33

标签: python-3.x tensorflow neural-network conv-neural-network

我先说这是我在SO上发布的第一个问题。我刚刚开始使用Tensorflow,并且一直在尝试应用卷积神经网络模型方法来分类文件中的.csv记录,该文件代表来自微阵列数据扫描的图像。 (仅供参考:微阵列是载玻片上的斑点DNA网格,代表用于确定样品中那些DNA靶标存在的特定DNA靶序列。各个像素代表0-1的荧光强度值)。该文件总共有大约200,000条记录。每个记录(图像)具有10816个像素,代表来自已知病毒的DNA序列,以及一个标识病毒种类的索引标签。像素创建了一种模式,这种模式对于每种不同的病毒都是独特的。在200,000条记录中共有2165种不同的病毒。我已经在标记的微阵列数据集的图像上训练了网络,但是当我尝试通过一个新的数据集来将它/它们分类为2165种不同的病毒之一并确定预测值和概率时,我似乎没有多少运气。这是我目前使用的代码:

import tensorflow as tf
import numpy as np
import csv

def extract_data(filename):
    print("extracting data...")
    NUM_LABELS = 2165 
    NUM_FEATURES = 10816
    labels = []
    fvecs = []
    rowCount = 0

#iterate over the rows, split the label from the features
#convert the labels to integers and features to floats

    for line in open(filename):
        rowCount = rowCount + 1
        row = line.split(',')
        labels.append(row[3])#(int(row[7])) #<<<IT ALWAYS PREDICTS THIS VALUE!
        for x in row [4:10820]:
            fvecs.append(float(x))

#convert the array of float arrasy into a numpy float matrix
    fvecs_np = np.matrix(fvecs).astype(np.float32)

#convert the array of int lables inta a numpy array
    labels_np = np.array(labels).astype(dtype=np.uint8)

#convert the int numpy array into a one-hot matrix
    labels_onehot = (np.arange(NUM_LABELS) == labels_np[:, None]).astype(np.float32)
    print("arrays converted")
    return fvecs_np, labels_onehot

def TestModels():
    fvecs_np, labels_onehot = extract_data("MicroarrayTestData.csv")
    print('RESTORING NN MODEL')
    weights = {}
    biases = {}
    sess=tf.Session()  
    init = tf.global_variables_initializer()

    #Load meta graph and restore weights

    ModelID = "MicroarrayCNN_Data-1000.meta" 
    print("RESTORING:::", ModelID)
    saver = tf.train.import_meta_graph(ModelID)
    saver.restore(sess,tf.train.latest_checkpoint('./'))

    graph = tf.get_default_graph()
    x = graph.get_tensor_by_name("x:0")
    y = graph.get_tensor_by_name("y:0")
    keep_prob = tf.placeholder(tf.float32) 
    y_ = tf.placeholder("float", shape=[None, 2165])

    wc1 = graph.get_tensor_by_name("wc1:0")
    wc2 = graph.get_tensor_by_name("wc2:0")
    wd1 = graph.get_tensor_by_name("wd1:0")
    Wout = graph.get_tensor_by_name("Wout:0")
    bc1 = graph.get_tensor_by_name("bc1:0")
    bc2 = graph.get_tensor_by_name("bc2:0")
    bd1 = graph.get_tensor_by_name("bd1:0")
    Bout = graph.get_tensor_by_name("Bout:0")

    weights = {wc1, wc2, wd1, Wout}
    biases = {bc1, bc2, bd1, Bout}

    print("NEXTArgmax") 
    prediction=tf.argmax(y,1)
    probabilities = y
    predY = prediction.eval(feed_dict={x: fvecs_np, y: labels_onehot}, session=sess)
    probY = probabilities.eval(feed_dict={x: fvecs_np, y: labels_onehot},  session=sess)

    accuracy = tf.reduce_mean(tf.cast(prediction, "float"))
    print(sess.run(accuracy, feed_dict={x: fvecs_np, y: labels_onehot}))
    print("%%%%%%%%%%%%%%%%%%%%%%%%%%")
    print("Predicted::: ", predY, accuracy)
    print("%%%%%%%%%%%%%%%%%%%%%%%%%%")

    feed_dictTEST = {y: labels_onehot}
    probabilities=probY
    print("probabilities", probabilities.eval(feed_dict={x: fvecs_np}, session=sess))    

########## Run Analysis ###########
TestModels()

所以,当我运行这段代码时,我得到了测试集的正确预测,虽然我不确定我是否相信它,因为看起来我在第14行附加的值(见下文)是它预测的输出:

labels.append(row[3])#<<<IT ALWAYS PREDICTS THIS VALUE!

我不明白这一点,并且让我怀疑我是否错误地设置了CNN,因为我原本希望它忽略我的输入标签并根据训练的模式确定来自训练网络的韧皮匹配。我唯一可以想到的是,当我将值传递给预测时;相反,它也是在这个数据上训练模型,然后预测自己。这是一个正确的假设,还是我误解了Tensorflow的工作原理?

另一个问题是,当我尝试使用(基于其他教程)代码输出所有2165个可能输出的概率时,我得到错误:

InvalidArgumentError (see above for traceback): Shape [-1,2165] has negative     dimensions
[[Node: y = Placeholder[dtype=DT_FLOAT, shape=[?,2165], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

对我来说,看起来它是基于Tensor形状中2165值的正确图层,但我不理解-1值。所以,总结一下,我的问题是:

  1. 基于我获得输入数据标签中的值的事实,这是使用此模型进行分类的正确方法吗?

  2. 我是否错过了一个图层,或者我是否错误地配置了模型以提取所有可能的输出类的概率,或者我使用了错误的代码来提取信息?我尝试打印出准确性以确定它是否有效,但它会输出张量的描述,所以显然这也是不正确的。

  3. (附加资料)

    根据要求,我还包括用于训练模型的原始代码,现在在下面。当我遍历文件时,您可以通过它们的分类关系一次看到我对有限数量的相关记录进行分类培训。这主要是因为我正在训练的Mac(Mac Pro w / 64GB ram)由于过度使用资源而倾向于给我“杀死-9”错误,如果我不这样做的话。可能有更好的方法,但这似乎有效。

    Original Author: Aymeric Damien
    Project: https://github.com/aymericdamien/TensorFlow-Examples/
    
    from __future__ import print_function
    
    import tensorflow as tf
    import numpy as np
    import csv
    import random
    
    # Parameters
    
    num_epochs = 2
    train_size = 1609
    learning_rate = 0.001  #(larger >speed, lower >accuracy)
    training_iters = 5000 # How much do you want to train (more = better trained)
    batch_size = 32   #How many samples to train on, size of the training batch
    display_step = 10 # How often to diplay what is going on during training
    
    # Network Parameters
    n_input = 10816 # MNIST data input (img shape: 28*28)...in my case 104x104 = 10816(rough array size)
    n_classes = 2165 #3280 #2307 #787# Switched to 100 taxa/training set, dynamic was too wonky. 
    dropout = 0.75 # Dropout, probability to keep units.  Jeffery Hinton's group developed it, that prevents overfitting to find new paths.  More generalized model. 
    
    # Functions
    
    def extract_data(filename):
        print("extracting data...")
        # arrays to hold the labels and feature vectors.
        NUM_LABELS = 2165 
        NUM_FEATURES = 10826
    
        taxCount = 0
        taxCurrent = 0
    
        labels = []
        fvecs = []
        rowCount = 0
    
        #iterate over the rows, split the label from the features
        #convert the labels to integers and features to floats
        print("entering CNN loop")
        for line in open(filename):
    
            rowCount = rowCount + 1
            row = line.split(',')
    
            taxCurrent = row[3]
            print("profile:", row[0:12])
            labels.append(int(row[3]))
            fvecs.append([float(x) for x in row [4:10820]])
    
        #convert the array of float arrasy into a numpy float matrix
        fvecs_np = np.matrix(fvecs).astype(np.float32)
        #convert the array of int lables inta a numpy array
        labels_np = np.array(labels).astype(dtype=np.uint8)
        #convert the int numpy array into a one-hot matrix
        labels_onehot = (np.arange(NUM_LABELS) == labels_np[:, None]).astype(np.float32)
        print("arrays converted")
        return fvecs_np, labels_onehot
    
    # Create some wrappers for simplicity
    def conv2d(x, W, b, strides=1): #Layer 1 : Convolutional layer
        # Conv2D wrapper, with bias and relu activation
        print("conv2d")
        x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') # Strides are the tensors...list of integers.  Tensors=data
        x = tf.nn.bias_add(x, b)  #bias is the tuning knob
        return tf.nn.relu(x) #rectified linear unit (activation function)
    
    def maxpool2d(x, k=2): #Layer 2 : Takes samples from the image. (This is a 4D tensor)
        print("maxpool2d")
        # MaxPool2D wrapper
        return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],
                              padding='SAME')
    
    # Create model
    def conv_net(x, weights, biases, dropout):
        print("conv_net setup")
        # Reshape input picture
        x = tf.reshape(x, shape=[-1, 104, 104, 1])  #-->52x52 , -->26x26x64
    
        # Convolution Layer
        conv1 = conv2d(x, weights['wc1'], biases['bc1']) #defined above already
        # Max Pooling (down-sampling)
        conv1 = maxpool2d(conv1, k=2)
        print(conv1.get_shape)
    
        # Convolution Layer
        conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])  #wc2 and bc2 are just placeholders...could actually skip this layer...maybe
        # Max Pooling (down-sampling)
        conv2 = maxpool2d(conv2, k=2)
        print(conv2.get_shape)
    
        # Fully connected layer
        # Reshape conv2 output to fit fully connected layer input
        fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
        fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
        fc1 = tf.nn.relu(fc1) #activation function for the NN
        # Apply Dropout
        fc1 = tf.nn.dropout(fc1, dropout)
    
        # Output, class prediction
        out = tf.add(tf.matmul(fc1, weights['Wout']), biases['Bout'])
        return out
    
    def Train_Network(Txid_IN, Sess_File_Name):
    
        import tensorflow as tf
        tf.reset_default_graph()
    
        x,y = 0,0
        weights = {}
        biases = {}
    
        # tf Graph input
        print("setting placeholders")
        x = tf.placeholder(tf.float32, [None, n_input], name="x")  #Gateway for data (images)
        y = tf.placeholder(tf.float32, [None, n_classes], name="y") # Gateway for data (labels)
        keep_prob = tf.placeholder(tf.float32) #dropout # Gateway for dropout(keep probability)
    
        # Store layers weight & bias
        #CREATE weights
    
        weights = {
            # 5x5 conv, 1 input, 32 outputs
            'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32]),name="wc1"), #
            # 5x5 conv, 32 inputs, 64 outputs
            'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64]),name="wc2"),
            # fully connected, 7*7*64 inputs, 1024 outputs
            'wd1': tf.Variable(tf.random_normal([26*26*64, 1024]),name="wd1"),
            # 1024 inputs, 10 outputs (class prediction)
            'Wout': tf.Variable(tf.random_normal([1024, n_classes]),name="Wout")
        }
    
        biases = {
            'bc1': tf.Variable(tf.random_normal([32]), name="bc1"),
            'bc2': tf.Variable(tf.random_normal([64]), name="bc2"),
            'bd1': tf.Variable(tf.random_normal([1024]), name="bd1"),
            'Bout': tf.Variable(tf.random_normal([n_classes]), name="Bout")
        }
    
        # Construct model
        print("constructing model")
        pred = conv_net(x, weights, biases, keep_prob)
    
        print(pred)
    
        # Define loss(cost) and optimizer
        #cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y)) Deprecated version of the statement
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = pred, labels=y)) #added reduce_mean 6/27
    
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
    
        # Evaluate model
        correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    
        print("%%%%%%%%%%%%%%%%%%%%")
        print ("%%   ", correct_pred)
        print ("%%   ", accuracy)
        print("%%%%%%%%%%%%%%%%%%%%")
    
        # Initializing the variables
        #init = tf.initialize_all_variables()
        init = tf.global_variables_initializer()
    
        saver = tf.train.Saver()
    
        fvecs_np, labels_onehot = extract_data("MicroarrayDataOUT.csv")  #CHAGE TO PICORNAVIRUS!!!!!AHHHHHH!!!
        print("starting session")
        # Launch the graph
        FitStep = 0
        with tf.Session() as sess:  #graph is encapsulated by its session
            sess.run(init)
            step = 1
            # Keep training until reach max iterations (training_iters)
            while step * batch_size < training_iters:
                if FitStep >= 5:
                    break
                else:
                    #iterate and train
                    print(step)
                    print(fvecs_np, labels_onehot)
                    for step in range(num_epochs * train_size // batch_size):
                        sess.run(optimizer, feed_dict={x: fvecs_np, y: labels_onehot, keep_prob:dropout})  #no dropout???...added Keep_prob:dropout
                        if FitStep >= 5:
                            break
                        #else:
                    ###batch_x, batch_y = mnist.train.next_batch(batch_size)
                    # Run optimization op (backprop)
                    ###sess.run(optimizer, feed_dict={x: batch_x, y: batch_y,
                    ###                               keep_prob: dropout})          <<<<SOMETHING IS WRONG IN HERE?!!!
                        if step % display_step == 0:
                            # Calculate batch loss and accuracy
                                loss, acc = sess.run([cost, accuracy], feed_dict={x: fvecs_np,
                                                                              y: labels_onehot,
                                                                              keep_prob: 1.})
    
                                print("Iter " + str(step*batch_size) + ", Minibatch Loss= " + \
                                  "{:.6f}".format(np.mean(loss)) + ", Training Accuracy= " + \
                                  "{:.5f}".format(acc))
                                TrainAcc = float("{:.5f}".format(acc))
                                #print("******", TrainAcc)
                                if TrainAcc >= .99: #Changed from .95 temporarily
                                    print(FitStep)
                                    FitStep = FitStep+1
                                saver.save(sess, Sess_File_Name, global_step=1000) #
                                print("Saved Session:", Sess_File_Name)
                        step += 1
            print("Optimization Finished!")
    
            print("Testing Accuracy:", \
                sess.run(accuracy, feed_dict={x: fvecs_np[:256],
                                              y: labels_onehot[:256],
                                              keep_prob: 1.}))
    
            #feed_dictTEST = {x: fvecs_np[50]}
            #prediction=tf.argmax(y,1)
            #print(prediction)
            #best = sess.run([prediction],feed_dictTEST)
            #print(best)
            print("DONE")
    
        sess.close()
    
    
    def Tax_Iterator(CSV_inFile, CSV_outFile): #Deprecate
    
        #Need to copy *.csv file to MySQL for sorting
    
        resultFileINIT = open(CSV_outFile,'w') 
        resultFileINIT.close()
    
        TaxCount = 0
        TaxThreshold = 2165
        ThresholdStep = 2165
        PrevTax = 0
        linecounter = 0
        #Open all GenBank profile list
        for line in open(CSV_inFile):
            linecounter = linecounter+1
            print(linecounter)
            resultFile = open(CSV_outFile,'a') 
            wr = csv.writer(resultFile, dialect='excel')
    
            # Check for new TXID
            row = line.split(',')
            print(row[7], "===", PrevTax)
            if row[7] != PrevTax:
                print("X1")
                TaxCount = TaxCount+1
                PrevTax = row[7]
    
            #Check it current Tax count is < or > threshold
                # < threshold
            print(TaxCount,"=+=", TaxThreshold)
            if TaxCount<=3300:
                print("X2")
    
                CurrentTax= row[7]
    
                CurrTxCount = CurrentTax  
    
                print("TaxCount=", TaxCount)
                print( "Add to CSV")
                print("row:", CurrentTax, "***", row[0:15])
    
                wr.writerow(row[0:-1])
    
               # is > threshold
            else:
                print("X3")
                # but same TXID....
                print(row[7], "=-=", CurrentTax)
                if row[7]==CurrentTax:
                    print("X4")
                    CurrentTax= row[7]
                    print("TaxCount=", TaxCount)
                    print( "Add to CSV")
                    print("row:", CurrentTax, "***", row[0:15])
    
                    wr.writerow(row[0:-1])
    
                # but different TXID...
                else:
    
                    print(row[7], "=*=", CurrentTax)
                    if row[7]>CurrentTax:
                        print("X5")
                        TaxThreshold=TaxThreshold+ThresholdStep
                        resultFile.close()
    
                        Sess_File_Name = "CNN_VirusIDvSPECIES_XXALL"+ str(TaxThreshold-ThresholdStep)
                        print("<<<< Start Training >>>>"
                        print("Training on :: ", CurrTxCount, "Taxa", TaxCount, "data points.")
    
                        Train_Network(CurrTxCount, Sess_File_Name)
                        print("Training complete")
                        resultFileINIT = open(CSV_outFile,'w') 
                        resultFileINIT.close()
                        CurrentTax= row[7]
    
                        #reset tax count
    
                        CurrTxCount = 0
                        TaxCount = 0
        resultFile.close()
    
        Sess_File_Name = "MicroarrayCNN_Data"+ str(TaxThreshold+ThresholdStep)
        print("<<<< Start Training >>>>")
    
        print("Training on :: ", CurrTxCount, "Taxa", TaxCount, "data points.")
        Train_Network(CurrTxCount, Sess_File_Name)
        resultFileINIT = open(CSV_outFile,'w') 
        resultFileINIT.close()
        CurrentTax= row[7]   
    
    
    Tax_Iterator("MicroarrayInput.csv", "MicroarrayOutput.csv") 
    

1 个答案:

答案 0 :(得分:0)

您将预测定义为prediction=tf.argmax(y,1)。在feed_dict中,您为labels_onehot提供y。因此,您的预测&#34;总是等于标签。

由于您没有发布用于培训网络的代码,因此我无法告诉您需要更改的内容。

编辑:我已经了解了您尝试解决的根本问题 - 根据您的代码,您尝试使用1609个训练示例来训练具有2165个不同类别的神经网络。这怎么可能呢?如果每个示例都有不同的类,那么仍然会有一些没有任何训练示例的类。或者一个图像属于多个类?从您在问题开头的陈述中,我假设您正在尝试输出介于0-1之间的实值数字。

我真的很惊讶代码实际上有效,因为您似乎只在labels列表中添加了一个数字,但您的模型需要一个长度为{{1}的列表}对于每个训练样例。