我正在实现一个用于字符串数据的令牌分类的convnet。一世 需要从TFRecord中获取字符串数据,批量洗牌,然后执行一些扩展数据的处理,然后再次批处理。这可能是两个batch_shuffle操作吗?
这就是我需要做的事情:
当然,一种解决方案是在将数据加载到TF之前对数据进行预处理,但这会占用更多的磁盘空间。
数据
以下是一些示例数据。我有两个"例子"。每个示例都包含标记化句子的特征和每个标记的标签:
sentences = [
[ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'],
['then', 'the', 'lazy', 'dog', 'slept', '.']
]
sent_labels = [
['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'],
['O', 'O', 'O', 'ANIMAL', 'O', 'O']
]
每个"示例"现在具有以下功能(为清晰起见,有些减少):
features {
feature {
key: "labels"
value {
bytes_list {
value: "O"
value: "O"
value: "O"
value: "ANIMAL"
...
}
}
}
feature {
key: "sentence"
value {
bytes_list {
value: "the"
value: "quick"
value: "brown"
value: "fox"
...
}
}
}
}
转化
批处理稀疏数据后,我收到一个句子列表:
[''' quick'' brown',' fox',...]
我需要首先将列表PAD到预定的SEQ_LEN,然后插入 每个例子中的位置索引,旋转位置使得 toke我想分类是在pos 0,每个位置标记都是相对的 到0位置:
[
['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the'
['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick
['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown
['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox
]
批处理和重新打包数据
以下是我尝试做的简化版本:
# Enqueue the Filenames and serialize
filenames =[outfilepath]
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ')
reader = tf.TFRecordReader()
key, serialized_example = reader.read(fq)
# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue)
# Parse Sparse Tensors, make into single dense Tensor
# ['the', 'quick', 'brown', 'fox']
parsed = tf.parse_example(data_batch, features=feature_mapping)
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>')
sent_len = tf.shape(dense_tensor_sentence)[1]
SEQ_LEN = 5
NUM_PADS = SEQ_LEN - sent_len
#['the', 'quick', 'brown', 'fox', 'PAD']
padded_sentence = pad(dense_tensor_sentence, NUM_PADS)
# make sent_len X SEQ_LEN copy of sentence, position vectors
#[
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ]
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ]
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ]
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ]
# NOTE: There is no row where PAD is with a position 0, because I don't
# want to classify the PAD token
#]
examples_with_positions = replicate_and_insert_positions(padded_sentence)
# While my SEQ_LEN will be constant, the sent_len will not. Therefore,
#I don't know the number of rows, but I can guarantee the number of
# columns. shape = (?,SEQ_LEN)
dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN)
# Try Random Shuffle Queue:
# Rebatch <-- This is where the problem is
#reshape_concat.set_shape((None, SEQ_LEN))
random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,))
random_queue.enqueue_many(dynamic_input)
batch = random_queue.dequeue_many(4)
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables())
sess = create_session()
sess.run(init_op)
#tf.get_default_graph().finalize()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
i = 0
while True:
print sess.run(batch)
i += 1
except tf.errors.OutOfRangeError as e:
print "No more inputs."
修改
我现在正在尝试使用RandomShuffleQueue。在每个入队列表中,我想将具有形状的批次排入队列(无,SEQ_LEN)。我修改了上面的代码以反映这一点。
我不再对输入形状抱怨,但排队确实挂起sess.run(batch)
答案 0 :(得分:1)
我正在接近整个问题。我错误地认为我必须在插入#DIV/0!
时定义批处理的完整形状,但实际上我只需要定义我输入的每个元素的形状,并设置tf.batch_shuffle
。
这是正确的代码:
enqueue_many=True