双批量张量流量输入数据

时间:2017-02-09 19:09:59

标签: tensorflow nlp

我正在实现一个用于字符串数据的令牌分类的convnet。一世 需要从TFRecord中获取字符串数据,批量洗牌,然后执行一些扩展数据的处理,然后再次批处理。这可能是两个batch_shuffle操作吗?

这就是我需要做的事情:

  1. 将文件名排入文件队列
  2. 对于每个序列化示例,放入shuffle_batch
  3. 当我从洗牌批次中拉出每个例子时,我需要PAD它,按序列长度复制它,合并位置向量,这为第一批中的每个原始示例创建了多个示例。我需要再次批量处理。
  4. 当然,一种解决方案是在将数据加载到TF之前对数据进行预处理,但这会占用更多的磁盘空间。

    数据

    以下是一些示例数据。我有两个"例子"。每个示例都包含标记化句子的特征和每个标记的标签:

    sentences = [
                 [ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'], 
                 ['then', 'the', 'lazy', 'dog', 'slept', '.']
               ]
    sent_labels = [ 
                ['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'],
                ['O', 'O', 'O', 'ANIMAL', 'O', 'O']
              ]
    

    每个"示例"现在具有以下功能(为清晰起见,有些减少):

    features {
      feature {
        key: "labels"
        value {
          bytes_list {
            value: "O"
            value: "O"
            value: "O"
            value: "ANIMAL"
            ...
           }
        }
      }
    
      feature {
        key: "sentence"
        value {
          bytes_list {
            value: "the"
            value: "quick"
            value: "brown"
            value: "fox"
            ...
          }
        }
      }
    }
    

    转化

    批处理稀疏数据后,我收到一个句子列表:

    [''' quick'' brown',' fox',...]

    我需要首先将列表PAD到预定的SEQ_LEN,然后插入 每个例子中的位置索引,旋转位置使得 toke我想分类是在pos 0,每个位置标记都是相对的 到0位置:

    [ 
     ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the'
     ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick
     ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown
     ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox
    ]
    

    批处理和重新打包数据

    以下是我尝试做的简化版本:

    # Enqueue the Filenames and serialize 
    filenames =[outfilepath]
    fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ')
    reader = tf.TFRecordReader()
    key, serialized_example = reader.read(fq)
    
    # Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time
    initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue)
    
    
    # Parse Sparse Tensors, make into single dense Tensor
    # ['the', 'quick', 'brown', 'fox']
    parsed = tf.parse_example(data_batch, features=feature_mapping)
    dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>')
    sent_len = tf.shape(dense_tensor_sentence)[1]
    
    SEQ_LEN = 5
    NUM_PADS = SEQ_LEN - sent_len
    #['the', 'quick', 'brown', 'fox', 'PAD']
    padded_sentence = pad(dense_tensor_sentence, NUM_PADS)
    
    # make sent_len X SEQ_LEN copy of sentence, position vectors
    #[ 
    # ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ] 
    # ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] 
    # ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] 
    # ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] 
    # NOTE: There is no row where PAD is with a position 0, because I don't
    # want to classify the PAD token 
    #]
    examples_with_positions = replicate_and_insert_positions(padded_sentence)
    
    #  While my SEQ_LEN will be constant, the sent_len will not.  Therefore, 
    #I don't know the number of rows, but I can guarantee the number of 
    # columns. shape = (?,SEQ_LEN)
    
    dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN)
    
    # Try Random Shuffle Queue: 
    
    # Rebatch <-- This is where the problem is
    #reshape_concat.set_shape((None, SEQ_LEN))
    
    random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,))
    random_queue.enqueue_many(dynamic_input)
    batch = random_queue.dequeue_many(4)
    
    
    init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables())
    
    sess = create_session()
    sess.run(init_op)
    
    #tf.get_default_graph().finalize()
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    try:
      i = 0  
      while True:
        print sess.run(batch)
    
        i += 1
    except tf.errors.OutOfRangeError as e:
      print "No more inputs."
    

    修改

    我现在正在尝试使用RandomShuffleQueue。在每个入队列表中,我想将具有形状的批次排入队列(无,SEQ_LEN)。我修改了上面的代码以反映这一点。

    我不再对输入形状抱怨,但排队确实挂起sess.run(batch)

1 个答案:

答案 0 :(得分:1)

我正在接近整个问题。我错误地认为我必须在插入#DIV/0!时定义批处理的完整形状,但实际上我只需要定义我输入的每个元素的形状,并设置tf.batch_shuffle

这是正确的代码:

enqueue_many=True