Question

我正在实现一个用于字符串数据的令牌分类的convnet。一世需要从TFRecord中获取字符串数据，批量洗牌，然后执行一些扩展数据的处理，然后再次批处理。这可能是两个batch_shuffle操作吗？

这就是我需要做的事情：

将文件名排入文件队列
对于每个序列化示例，放入shuffle_batch
当我从洗牌批次中拉出每个例子时，我需要PAD它，按序列长度复制它，合并位置向量，这为第一批中的每个原始示例创建了多个示例。我需要再次批量处理。

当然，一种解决方案是在将数据加载到TF之前对数据进行预处理，但这会占用更多的磁盘空间。

数据

以下是一些示例数据。我有两个＆＃34;例子＆＃34;。每个示例都包含标记化句子的特征和每个标记的标签：

sentences = [
             [ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'], 
             ['then', 'the', 'lazy', 'dog', 'slept', '.']
           ]
sent_labels = [ 
            ['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'],
            ['O', 'O', 'O', 'ANIMAL', 'O', 'O']
          ]

每个＆＃34;示例＆＃34;现在具有以下功能（为清晰起见，有些减少）：

features {
  feature {
    key: "labels"
    value {
      bytes_list {
        value: "O"
        value: "O"
        value: "O"
        value: "ANIMAL"
        ...
       }
    }
  }

  feature {
    key: "sentence"
    value {
      bytes_list {
        value: "the"
        value: "quick"
        value: "brown"
        value: "fox"
        ...
      }
    }
  }
}

转化

批处理稀疏数据后，我收到一个句子列表：

[＆＃39;＆＃39;＆＃39; quick＆＃39;＆＃39; brown＆＃39;，＆＃39; fox＆＃39;，...]

我需要首先将列表PAD到预定的SEQ_LEN，然后插入每个例子中的位置索引，旋转位置使得 toke我想分类是在pos 0，每个位置标记都是相对的到0位置：

[ 
 ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the'
 ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick
 ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown
 ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox
]

批处理和重新打包数据

以下是我尝试做的简化版本：

# Enqueue the Filenames and serialize 
filenames =[outfilepath]
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ')
reader = tf.TFRecordReader()
key, serialized_example = reader.read(fq)

# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue)


# Parse Sparse Tensors, make into single dense Tensor
# ['the', 'quick', 'brown', 'fox']
parsed = tf.parse_example(data_batch, features=feature_mapping)
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>')
sent_len = tf.shape(dense_tensor_sentence)[1]

SEQ_LEN = 5
NUM_PADS = SEQ_LEN - sent_len
#['the', 'quick', 'brown', 'fox', 'PAD']
padded_sentence = pad(dense_tensor_sentence, NUM_PADS)

# make sent_len X SEQ_LEN copy of sentence, position vectors
#[ 
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ] 
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] 
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] 
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] 
# NOTE: There is no row where PAD is with a position 0, because I don't
# want to classify the PAD token 
#]
examples_with_positions = replicate_and_insert_positions(padded_sentence)

#  While my SEQ_LEN will be constant, the sent_len will not.  Therefore, 
#I don't know the number of rows, but I can guarantee the number of 
# columns. shape = (?,SEQ_LEN)

dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN)

# Try Random Shuffle Queue: 

# Rebatch <-- This is where the problem is
#reshape_concat.set_shape((None, SEQ_LEN))

random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,))
random_queue.enqueue_many(dynamic_input)
batch = random_queue.dequeue_many(4)


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables())

sess = create_session()
sess.run(init_op)

#tf.get_default_graph().finalize()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
  i = 0  
  while True:
    print sess.run(batch)

    i += 1
except tf.errors.OutOfRangeError as e:
  print "No more inputs."

修改

我现在正在尝试使用RandomShuffleQueue。在每个入队列表中，我想将具有形状的批次排入队列（无，SEQ_LEN）。我修改了上面的代码以反映这一点。

我不再对输入形状抱怨，但排队确实挂起sess.run(batch)

Answer 1

我正在接近整个问题。我错误地认为我必须在插入#DIV/0!时定义批处理的完整形状，但实际上我只需要定义我输入的每个元素的形状，并设置tf.batch_shuffle。

这是正确的代码：

enqueue_many=True

双批量张量流量输入数据

1 个答案: