Question

我正在尝试将数据提供给LSTM。我正在审核Tensorflow's RNN tutorial here中的代码。

感兴趣的代码段来自教程的reader.py文件，特别是ptb_producer函数，该函数输出了X和Y LSTM。

raw_data是ptb.train.txt文件中的单词索引列表。 raw_data的长度为929,589。 batch_size为20，num_steps为35. batch_size和num_steps都基于LARGEconfig，它将数据提供给LSTM。

我已经完成了代码（并为我已经打印过的内容添加了评论），直到tf.strided_slice我才明白。从重塑，我们有一个形状为(20, 46497)的索引矩阵。

i的第一次迭代中的跨步切片，尝试从[0, i * num_steps + 1]获取[0,1*35+1]到[batch_size, (i + 1) * num_steps + 1] [20, (1+1)*35+1]的数据。

两个问题：

矩阵中的[0,1*35+1]和[20, (1+1)*35+1]？ strided_slice中(20, 46497)开头和结尾的哪些位置正在尝试访问？
似乎i的每次迭代都会从0中获取数据？数据矩阵的最开始是(20, 46497)？

我想我不理解的是，如果批量大小和num_steps（序列长度），您将如何将数据输入LSTM。

我已阅读colahs blog on LSTM和Karpathy's blog on RNN，这有助于理解LSTM，但似乎并没有解决将数据导入LSTM的确切机制。（也许我错过了什么？）

def ptb_producer(raw_data, batch_size, num_steps, name=None):
  """Iterate on the raw PTB data.
  This chunks up raw_data into batches of examples and returns Tensors that
  are drawn from these batches.
  Args:
    raw_data: one of the raw data outputs from ptb_raw_data.
    batch_size: int, the batch size.
    num_steps: int, the number of unrolls.
    name: the name of this operation (optional).
  Returns:
    A pair of Tensors, each shaped [batch_size, num_steps]. The second 
    element of the tuple is the same data time-shifted to the right by one.
  Raises:
    tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
  """
  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
    data_len = tf.size(raw_data)   # prints 929,589
    batch_len = data_len // batch_size   # prints 46,497
    data = tf.reshape(raw_data[0 : batch_size * batch_len],
                      [batch_size, batch_len])      
    #this truncates raw data to a multiple of batch_size=20, 
    #then reshapes to [20, 46497].  prints (20,?)

    epoch_size = (batch_len - 1) // num_steps   #prints 1327 (number of epoches)
    assertion = tf.assert_positive(
        epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    #for each of the 1327 epoches
    x = tf.strided_slice(data, [0, i * num_steps], [batch_size, (i + 1) * num_steps])   # prints (?, ?) 
    x.set_shape([batch_size, num_steps])  #prints (20,35) 
    y = tf.strided_slice(data, [0, i * num_steps + 1], [batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

Answer 1

矩阵中的哪里是[0,1 * 35 + 1]和[20，（1 + 1）* 35 + 1]？在（20，46497）中strided_slice的开始和结尾中的哪些位置试图访问？

因此，数据矩阵排列为20 x 46497。可以将其视为20个句子，每个句子包含46497个字符。您的特定示例中的跨步切片将从每行中获取字符36至70（不包括71，典型的范围语义）。这相当于

strided = [data[i][36:71] for i in range(20)]

似乎i的每次迭代都会从0接收数据？数据矩阵的最开始（20，46497）？

这是正确的。对于每次迭代，我们都希望在第一维上具有批处理大小的元素，即，我们需要大小为batch size x num_steps的矩阵。因此，第一个维度始终从0到19，基本上遍历每个句子。第二维以num_steps递增，从每个句子中获取单词的片段。如果我们查看它在x的每一步中从数据中提取的索引，我们将得到类似的信息：

- data[0:20,0:35]
- data[0:20,35:70]
- data[0:20,70:105]
- ...

对于y，我们加1是因为我们希望预测下一个字符。因此，您的火车，标签元组将像：

- (data[0:20,0:35], data[0:20, 1:36])
- (data[0:20,35:70], data[0:20,36:71])
- (data[0:20,70:105], data[0:20,71:106])
- ...

将数据输入LSTM - Tensorflow RNN PTB教程

1 个答案: